Dissertations / Theses: 'Natural language processing techniques'

1

Cosh, Kenneth John. "Supporting organisational semiotics with natural language processing techniques." Thesis, Lancaster University, 2003. http://eprints.lancs.ac.uk/12351/.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Harmain, H. M. "Building object-oriented conceptual models using natural language processing techniques." Thesis, University of Sheffield, 2000. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.312740.

Full text

APA, Harvard, Vancouver, ISO, and other styles

3

Califf, Mary Elaine. "Relational learning techniques for natural language information extraction /." Digital version accessible at:, 1998. http://wwwlib.umi.com/cr/utexas/main.

Full text

APA, Harvard, Vancouver, ISO, and other styles

4

Eyecioglu, Ozmutlu Asli. "Paraphrase identification using knowledge-lean techniques." Thesis, University of Sussex, 2016. http://sro.sussex.ac.uk/id/eprint/65497/.

Full text

Abstract:

This research addresses the problem of identification of sentential paraphrases; that is, the ability of an estimator to predict well whether two sentential text fragments are paraphrases. The paraphrase identification task has practical importance in the Natural Language Processing (NLP) community because of the need to deal with the pervasive problem of linguistic variation. Accurate methods for identifying paraphrases should help to improve the performance of NLP systems that require language understanding. This includes key applications such as machine translation, information retrieval and question answering amongst others. Over the course of the last decade, a growing body of research has been conducted on paraphrase identification and it has become an individual working area of NLP. Our objective is to investigate whether techniques concentrating on automated understanding of text requiring less resource may achieve results comparable to methods employing more sophisticated NLP processing tools and other resources. These techniques, which we call “knowledge-lean”, range from simple, shallow overlap methods based on lexical items or n-grams through to more sophisticated methods that employ automatically generated distributional thesauri. The work begins by focusing on techniques that exploit lexical overlap and text-based statistical techniques that are much less in need of NLP tools. We investigate the question “To what extent can these methods be used for the purpose of a paraphrase identification task?” For the two gold standard data, we obtained competitive results on the Microsoft Research Paraphrase Corpus (MSRPC) and reached the state-of-the-art results on the Twitter Paraphrase Corpus, using only n-gram overlap features in conjunction with support vector machines (SVMs). These techniques do not require any language specific tools or external resources and appear to perform well without the need to normalise colloquial language such as that found on Twitter. It was natural to extend the scope of the research and to consider experimenting on another language, which is poor in resources. The scarcity of available paraphrase data led us to construct our own corpus; we have constructed a paraphrasecorpus in Turkish. This corpus is relatively small but provides a representative collection, including a variety of texts. While there is still debate as to whether a binary or fine-grained judgement satisfies a paraphrase corpus, we chose to provide data for a sentential textual similarity task by agreeing on fine-grained scoring, knowing that this could be converted to binary scoring, but not the other way around. The correlation between the results from different corpora is promising. Therefore, it can be surmised that languages poor in resources can benefit from knowledge-lean techniques. Discovering the strengths of knowledge-lean techniques extended with a new perspective to techniques that use distributional statistical features of text by representing each word as a vector (word2vec). While recent research focuses on larger fragments of text with word2vec, such as phrases, sentences and even paragraphs, a new approach is presented by introducing vectors of character n-grams that carry the same attributes as word vectors. The proposed method has the ability to capture syntactic relations as well as semantic relations without semantic knowledge. This is proven to be competitive on Twitter compared to more sophisticated methods.

APA, Harvard, Vancouver, ISO, and other styles

5

Chong, Man Yan Miranda. "A study on plagiarism detection and plagiarism direction identification using natural language processing techniques." Thesis, University of Wolverhampton, 2013. http://hdl.handle.net/2436/298219.

Full text

Abstract:

Ever since we entered the digital communication era, the ease of information sharing through the internet has encouraged online literature searching. With this comes the potential risk of a rise in academic misconduct and intellectual property theft. As concerns over plagiarism grow, more attention has been directed towards automatic plagiarism detection. This is a computational approach which assists humans in judging whether pieces of texts are plagiarised. However, most existing plagiarism detection approaches are limited to super cial, brute-force stringmatching techniques. If the text has undergone substantial semantic and syntactic changes, string-matching approaches do not perform well. In order to identify such changes, linguistic techniques which are able to perform a deeper analysis of the text are needed. To date, very limited research has been conducted on the topic of utilising linguistic techniques in plagiarism detection. This thesis provides novel perspectives on plagiarism detection and plagiarism direction identi cation tasks. The hypothesis is that original texts and rewritten texts exhibit signi cant but measurable di erences, and that these di erences can be captured through statistical and linguistic indicators. To investigate this hypothesis, four main research objectives are de ned. First, a novel framework for plagiarism detection is proposed. It involves the use of Natural Language Processing techniques, rather than only relying on the vii traditional string-matching approaches. The objective is to investigate and evaluate the in uence of text pre-processing, and statistical, shallow and deep linguistic techniques using a corpus-based approach. This is achieved by evaluating the techniques in two main experimental settings. Second, the role of machine learning in this novel framework is investigated. The objective is to determine whether the application of machine learning in the plagiarism detection task is helpful. This is achieved by comparing a thresholdsetting approach against a supervised machine learning classi er. Third, the prospect of applying the proposed framework in a large-scale scenario is explored. The objective is to investigate the scalability of the proposed framework and algorithms. This is achieved by experimenting with a large-scale corpus in three stages. The rst two stages are based on longer text lengths and the nal stage is based on segments of texts. Finally, the plagiarism direction identi cation problem is explored as supervised machine learning classi cation and ranking tasks. Statistical and linguistic features are investigated individually or in various combinations. The objective is to introduce a new perspective on the traditional brute-force pair-wise comparison of texts. Instead of comparing original texts against rewritten texts, features are drawn based on traits of texts to build a pattern for original and rewritten texts. Thus, the classi cation or ranking task is to t a piece of text into a pattern. The framework is tested by empirical experiments, and the results from initial experiments show that deep linguistic analysis contributes to solving the problems we address in this thesis. Further experiments show that combining shallow and viii deep techniques helps improve the classi cation of plagiarised texts by reducing the number of false negatives. In addition, the experiment on plagiarism direction detection shows that rewritten texts can be identi ed by statistical and linguistic traits. The conclusions of this study o er ideas for further research directions and potential applications to tackle the challenges that lie ahead in detecting text reuse.

APA, Harvard, Vancouver, ISO, and other styles

6

Al, Qady Mohammed Abdelrahman. "Concept relation extraction using natural language processing the CRISP technique /." [Ames, Iowa : Iowa State University], 2008.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

7

Björner, Amanda. "Natural Language Processing techniques for feedback on text improvement : A qualitative study on press releases." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-301303.

Full text

Abstract:

Press releases play a key role in today’s news production by being public statements of newsworthy content that function as a pre-formulation of news. Press releases originate from a wide range of actors, and a common goal is for them to reach a high societal impact. This thesis examines how Natural Language Processing (NLP) techniques can be successful in giving feedback to press release authors that help enhance the content and quality of their texts. This could, in turn, contribute to increased impact. To examine this, the research question is divided into two parts. The first part examines how content-perception feedback can contribute to improving press releases. This is examined by the development of a web tool where user- written press releases get analyzed. The analysis consists of a readability assessment using the LIX metric and linguistic bias detection of weasel words and peacock words through rule-based sentiment analysis. The user experiences and opinions are evaluated through an online questionnaire and semi-structured interviews. The second part of the research question examines how trending topic information can contribute to improving press releases. This part is examined theoretically based on a literature review of state-of-the- art methods and qualitatively by gathering opinions from press release authors in the previously mentioned questionnaire and interviews. Based on the results, it is identified that for content-perception feedback, it is especially lesser experienced authors and scientific content aimed at the general public that would achieve improved text quality from objective readability assessment and detection of biased expressions. Nevertheless, most of the evaluation participants were more satisfied with their press releases after editing based on the readability feedback, and all participants with biased words in their texts reported that the detection led to positive changes resulting in improved text quality. As for the theoretical part, it is considered that both text quality and the number of publications increase when writing about trending topics. To give authors trending topic information on a detailed level is indicated to be the most helpful.
Aktörer som sträcker sig från privata företag till mydigheter och forskare använder pressmeddelanden för att offentligt delge information med nyhetsvärde. Dessa pressmeddelanden spelar därefter en nyckelroll i dagens nyhetsproduktion genom att förformulera nyheter och eftersträvar därför att hålla en viss språklig nivå. För att förbättra kvalitet och innehåll i pressmeddelanden undersöker detta examensarbete hur språkteknologisk textanalys och återkoppling till författare kan stödja dem i att förbättra sina texter. Denna frågeställning undersöks i två delar, en tillämpad del och en teoretisk del. Den tillämpade delen undersöker hur återkoppling kring innehållsuppfattning kan förbättra pressmeddelanden. Ett webb-baserat verktyg utvecklades där användare kan skriva in pressmeddelanden och få dessa analyserade. Analysen baseras på läsbarhet som bedöms med hjälp av måttet LIX samt språklig bias (partiska uttryck) i form av weasel words (vessleord) och peacock words (påfågelord) som detekteras genom regelbaserad sentimentanalys. Denna del utvärderades kvalitativt genom en enkätundersökning till användarna samt djupintervjuer. Den teoretiska delen av frågeställningen undersöker hur information om trendande ämnen kan bidra till att förbättra pressmeddelanden. Undersökningen genomfördes som en litteraturstudie och utvärderades kvalitativt genom att sammanställa åsikter från yrkesverksamma som arbetar med pressmeddelanden i enkätundersökningen och djupintervjuerna som beskrevs ovan. Resultaten indikerar att för feedback om innehållsuppfattning är det särskilt mindre erfarna författare och vetenskapligt innehåll riktat till allmänheten som skulle uppnå förbättrad textkvalitet till följd av läsbarhetsbedömning och upptäckt av partiska uttryck. Samtidigt var en majoritet av deltagarna i utvärderingen mer nöjda med sina pressmeddelanden efter redigering baserat på läsbarhetsfeedbacken. Dessutom rapporterade alla deltagare med partiska uttryck i sina texter att upptäckten ledde till positiva förändringar som resulterade i förbättrad textkvalitet. Gällande den teoretiska delen anses både textkvaliteten och antalet publikationer öka för pressmeddelnanden om trendande ämnen. Att ge författare information om trendande ämnen på en detaljerad nivå indikeras vara det mest hjälpsamma.

APA, Harvard, Vancouver, ISO, and other styles

8

Peri, Deepthi. "Applying Natural Language Processing and Deep Learning Techniques for Raga Recognition in Indian Classical Music." Thesis, Virginia Tech, 2020. http://hdl.handle.net/10919/99967.

Full text

Abstract:

In Indian Classical Music (ICM), the Raga is a musical piece's melodic framework. It encompasses the characteristics of a scale, a mode, and a tune, with none of them fully describing it, rendering the Raga a unique concept in ICM. The Raga provides musicians with a melodic fabric, within which all compositions and improvisations must take place. Identifying and categorizing the Raga is challenging due to its dynamism and complex structure as well as the polyphonic nature of ICM. Hence, Raga recognition—identify the constituent Raga in an audio file—has become an important problem in music informatics with several known prior approaches. Advancing the state of the art in Raga recognition paves the way to improving other Music Information Retrieval tasks in ICM, including transcribing notes automatically, recommending music, and organizing large databases. This thesis presents a novel melodic pattern-based approach to recognizing Ragas by representing this task as a document classification problem, solved by applying a deep learning technique. A digital audio excerpt is hierarchically processed and split into subsequences and gamaka sequences to mimic a textual document structure, so our model can learn the resulting tonal and temporal sequence patterns using a Recurrent Neural Network. Although training and testing on these smaller sequences, we predict the Raga for the entire audio excerpt, with the accuracy of 90.3% for the Carnatic Music Dataset and 95.6% for the Hindustani Music Dataset, thus outperforming prior approaches in Raga recognition.
Master of Science
In Indian Classical Music (ICM), the Raga is a musical piece's melodic framework. The Raga is a unique concept in ICM, not fully described by any of the fundamental concepts of Western classical music. The Raga provides musicians with a melodic fabric, within which all compositions and improvisations must take place. Raga recognition refers to identifying the constituent Raga in an audio file, a challenging and important problem with several known prior approaches and applications in Music Information Retrieval. This thesis presents a novel approach to recognizing Ragas by representing this task as a document classification problem, solved by applying a deep learning technique. A digital audio excerpt is processed into a textual document structure, from which the constituent Raga is learned. Based on the evaluation with third-party datasets, our recognition approach achieves high accuracy, thus outperforming prior approaches.

APA, Harvard, Vancouver, ISO, and other styles

9

Imperatore, Gennaro. "Improving ease and speed of use of mobile augmentative and alternative communication systems through the use of natural language processing and natural language generation techniques." Thesis, University of Strathclyde, 2016. http://digitool.lib.strath.ac.uk:80/R/?func=dbin-jump-full&object_id=27381.

Full text

Abstract:

Communication is recognised as a human right by the United Nations. Currently there are millions of people who for a variety of reasons cannot communicate comfortably. For example, in the UK alone there are 250,000 people who as a result of a stroke are now unable to communicate and are affected by a condition known as Aphasia. These people are said to have Complex Communication Needs. With the proliferation of smart devices like tablets and Smartphone, people with Complex Communication Needs are discovering the assistive potential of these devices to aid them either in the recuperation of their communicative abilities or to assist them in their daily lives. These systems are called Alternative and Augmentative Communication Systems. However current AAC systems suffer from the fact that they are cumbersome to use and users require a long time to form sentences, with the result that they cannot confidently communicate and therefore are left isolated and frustrated. Even though much work has been done in the area these systems are remain slow and communication is not effective. This thesis investigates whether the inclusion of Natural Language Processing and Natural Language Generation techniques into Augmentative and Alternative Communication systems on mobile devices can improve the ease and speed of use for users with Complex Communication Needs by implementing “Dictum” an AAC app which makes use of NLG/NLP techniques. The work followed the approach of Action Research in which the target users help the investigator by identifying the problem, sanctioning the research and evaluating the results. Therefore, users were actively involved in the design of the application from the very start and gave feedback after each iteration leading to the final application. This work has found that the inclusion of NLP and NLG in AAC does indeed improve ease and speed of use when compared to popular apps available today. Dictum improves speed by doing two things: reducing the set space of words by providing words that are relevant to the last word inserted by using a Semantic Network of nouns and allowing the user to build sentences by requiring selection of key words only and delegating the responsibility of sentence formation to the application itself. In addition, during the course of this work, an effective mechanism of capturing requirements for users with Complex Communication Needs discovered by looking at how users adapt the functionality of their devices. The app was evaluated both quantitatively, by computing keystrokes savings and evaluating the interface using well-established HCI laws, and qualitatively by asking for the feedback of potential users and Speech and Language Therapists, following the practice of Action Research to involve those touched by the problem.

APA, Harvard, Vancouver, ISO, and other styles

10

Antici, Francesco. "Advanced techniques for cross-language annotation projection in legal texts." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2021. http://amslaurea.unibo.it/23884/.

Full text

Abstract:

Nowadays, the majority of the services we benefit from, are provided online and their use is regulated by the acceptance to the terms of service by the users. All our data are handled accordingly with the clauses of such document and all our behaviours must comply with it. Given so, it would be very useful to find automated techniques to ensure fairness of the document or inform the users about possible threats. The focus of this work, is to create resources aimed to the development of such tools in languages other than English, which may lack in linguistic resources and annotated corpus. The enormous breakthroughs of the last years in Natural Language Processing techniques made it possible the creation of such tools through automated and unsupervised process. One of the means to achieve that is through the annotation projection between two parallel corpora. The difficulties and costs of creating ad hoc resource for every language has brought the need to find another way for achieving the goal.\\ This work investigates the cross language annotation projection technique based on sentence embedding and similarity metrics to find matches between sentences. Several combination of methods and algorithms are compared, among which there are monolingual and multilingual embedding neural models. The experiments are conducted on two datasets, where the reference language is always English and the projection are evaluated on Italian, German and Polish. The results obtained provide a robust and reliable technique for the task and a good starting point to build multilingual tools.

APA, Harvard, Vancouver, ISO, and other styles

11

Kiang, Kai-Ming Mechanical &amp Manufacturing Engineering Faculty of Engineering UNSW. "Natural feature extraction as a front end for simultaneous localization and mapping." Awarded by:University of New South Wales. School of Mechanical and Manufacturing Engineering, 2006. http://handle.unsw.edu.au/1959.4/26960.

Full text

Abstract:

This thesis is concerned with algorithms for finding natural features that are then used for simultaneous localisation and mapping, commonly known as SLAM in navigation theory. The task involves capturing raw sensory inputs, extracting features from these inputs and using the features for mapping and localising during navigation. The ability to extract natural features allows automatons such as robots to be sent to environments where no human beings have previously explored working in a way that is similar to how human beings understand and remember where they have been. In extracting natural features using images, the way that features are represented and matched is a critical issue in that the computation involved could be wasted if the wrong method is chosen. While there are many techniques capable of matching pre-defined objects correctly, few of them can be used for real-time navigation in an unexplored environment, intelligently deciding on what is a relevant feature in the images. Normally, feature analysis that extracts relevant features from an image is a 2-step process, the steps being firstly to select interest points and then to represent these points based on the local region properties. A novel technique is presented in this thesis for extracting a small enough set of natural features robust enough for navigation purposes. The technique involves a 3-step approach. The first step involves an interest point selection method based on extrema of difference of Gaussians (DOG). The second step applies Textural Feature Analysis (TFA) on the local regions of the interest points. The third step selects the distinctive features using Distinctness Analysis (DA) based mainly on the probability of occurrence of the features extracted. The additional step of DA has shown that a significant improvement on the processing speed is attained over previous methods. Moreover, TFA / DA has been applied in a SLAM configuration that is looking at an underwater environment where texture can be rich in natural features. The results demonstrated that an improvement in loop closure ability is attained compared to traditional SLAM methods. This suggests that real-time navigation in unexplored environments using natural features could now be a more plausible option.

APA, Harvard, Vancouver, ISO, and other styles

12

Bhaduri, Sreyoshi. "NLP in Engineering Education - Demonstrating the use of Natural Language Processing Techniques for Use in Engineering Education Classrooms and Research." Diss., Virginia Tech, 2018. http://hdl.handle.net/10919/82202.

Full text

Abstract:

Engineering Education is a developing field, with new research and ideas constantly emerging and contributing to the ever-evolving nature of this discipline. Textual data (such as publications, open-ended questions on student assignments, and interview transcripts) form an important means of dialogue between the various stakeholders of the engineering community. Analysis of textual data demands consumption of a lot of time and resources. As a result, researchers end up spending a lot of time and effort in analyzing such text repositories. While there is a lot to be gained through in-depth research analysis of text data, some educators or administrators could benefit from an automated system which could reveal trends and present broader overviews for given datasets in more time and resource efficient ways. Analyzing datasets using Natural Language Processing is one solution to this problem. The purpose of my doctoral research was two-pronged: first, to describe the current state of use of Natural Language Processing as it applies to the broader field of Education, and second, to demonstrate the use of Natural Language Processing techniques for two Engineering Education specific contexts of instruction and research respectively. Specifically, my research includes three manuscripts: (1) systematic review of existing publications on the use of Natural Language Processing in education research, (2) automated classification system for open-ended student responses to gauge metacognition levels in engineering classrooms, and (3) using insights from Natural Language Processing techniques to facilitate exploratory analysis of a large interview dataset led by a novice researcher. A common theme across the three tasks was to explore the use of Natural Language Processing techniques to enable the computer to extract meaningful information from textual data for Engineering Education related contexts. Results from my first manuscript suggested that researchers in the broader fields of Education used Natural Language Processing for a wide range of tasks, primarily serving to automate instruction in terms of creating content for examinations, automated grading or intelligent tutoring purposes. In manuscripts two and three I implemented some of the Natural Language Processing techniques such as Part-of-Speech tagging and tf-idf (text frequency-inverse document frequency) that were found (through my systematic review) to be used by researchers, to (a) develop an automated classification system for student responses to gauge their metacognitive levels and (b) conduct an exploratory novice led analysis of excerpts from interviews of students on career preparedness, respectively. Overall results of my research studies indicate that although the use of Natural Language Processing techniques in Engineering Education is not widespread, although such research endeavors could facilitate research and practice in our field. Particularly, this type of approach to textual data could be of use to practitioners in large engineering classrooms who are unable to devote large amounts of time to data analysis but would benefit from algorithmic systems that could quickly present a summary based on information processed from available text data.
Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

13

Norsten, Theodor. "Exploring the Potential of Twitter Data and Natural Language Processing Techniques to Understand the Usage of Parks in Stockholm." Thesis, KTH, Geoinformatik, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-278532.

Full text

Abstract:

Traditional methods used to investigate the usage of parks consists of questionnaire which is both a very time- and- resource consuming method. Today more than four billion people daily use some form of social media platform. This has led to the creation of huge amount of data being generated every day through various social media platforms and has created a potential new source for retrieving large amounts of data. This report will investigate a modern approach, using Natural Language Processing on Twitter data to understand how parks in Stockholm being used. Natural Language Processing (NLP) is an area within artificial intelligence and is referred to the process to read, analyze, and understand large amount of text data and is considered to be the future for understanding unstructured text. Twitter data were obtained through Twitters open API. Data from three parks in Stockholm were collected between the periods 2015-2019. Three analysis were then performed, temporal, sentiment, and topic modeling analysis. The results from the above analysis show that it is possible to understand what attitudes and activities are associated with visiting parks using NLP on social media data. It is clear that sentiment analysis is a difficult task for computers to solve and it is still in an early stage of development. The results from the sentiment analysis indicate some uncertainties. To achieve more reliable results, the analysis would consist of much more data, more thorough cleaning methods and be based on English tweets. One significant conclusion given the results is that people’s attitudes and activities linked to each park are clearly correlated with the different attributes each park consists of. Another clear pattern is that the usage of parks significantly peaks during holiday celebrations and positive sentiments are the most strongly linked emotion with park visits. Findings suggest future studies to focus on combining the approach in this report with geospatial data based on a social media platform were users share their geolocation to a greater extent.
Traditionella metoder använda för att förstå hur människor använder parker består av frågeformulär, en mycket tids -och- resurskrävande metod. Idag använder mer en fyra miljarder människor någon form av social medieplattform dagligen. Det har inneburit att enorma datamängder genereras dagligen via olika sociala media plattformar och har skapat potential för en ny källa att erhålla stora mängder data. Denna undersöker ett modernt tillvägagångssätt, genom användandet av Natural Language Processing av Twitter data för att förstå hur parker i Stockholm används. Natural Language Processing (NLP) är ett område inom artificiell intelligens och syftar till processen att läsa, analysera och förstå stora mängder textdata och anses vara framtiden för att förstå ostrukturerad text. Data från Twitter inhämtades via Twitters öppna API. Data från tre parker i Stockholm erhölls mellan perioden 2015–2019. Tre analyser genomfördes därefter, temporal, sentiment och topic modeling. Resultaten från ovanstående analyser visar att det är möjligt att förstå vilka attityder och aktiviteter som är associerade med att besöka parker genom användandet av NLP baserat på data från sociala medier. Det är tydligt att sentiment analys är ett svårt problem för datorer att lösa och är fortfarande i ett tidigt skede i utvecklingen. Resultaten från sentiment analysen indikerar några osäkerheter. För att uppnå mer tillförlitliga resultat skulle analysen bestått av mycket mer data, mer exakta metoder för data rensning samt baserats på tweets skrivna på engelska. En tydlig slutsats från resultaten är att människors attityder och aktiviteter kopplade till varje park är tydligt korrelerat med de olika attributen respektive park består av. Ytterligare ett tydligt mönster är att användandet av parker är som högst under högtider och att positiva känslor är starkast kopplat till park-besök. Resultaten föreslår att framtida studier fokuserar på att kombinera metoden i denna rapport med geospatial data baserat på en social medieplattform där användare delar sin platsinfo i större utsträckning.

APA, Harvard, Vancouver, ISO, and other styles

14

Salov, Aleksandar. "Towards automated learning from software development issues : Analyzing open source project repositories using natural language processing and machine learning techniques." Thesis, Linnéuniversitetet, Institutionen för medieteknik (ME), 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-66834.

Full text

Abstract:

This thesis presents an in-depth investigation on the subject of how natural language processing and machine learning techniques can be utilized in order to perform a comprehensive analysis of programming issues found in different open source project repositories hosted on GitHub. The research is focused on examining issues gathered from a number of JavaScript repositories based on their user generated textual description. The primary goal of the study is to explore how natural language processing and machine learning methods can facilitate the process of identifying and categorizing distinct issue types. Furthermore, the research goes one step further and investigates how these same techniques can support users in searching for potential solutions to these issues. For this purpose, an initial proof-of-concept implementation is developed, which collects over 30 000 JavaScript issues from over 100 GitHub repositories. Then, the system extracts the titles of the issues, cleans and processes the data, before supplying it to an unsupervised clustering model which tries to uncover any discernible similarities and patterns within the examined dataset. What is more, the main system is supplemented by a dedicated web application prototype, which enables users to utilize the underlying machine learning model in order to find solutions to their programming related issues. Furthermore, the developed implementation is meticulously evaluated through a number of measures. First of all, the trained clustering model is assessed by two independent groups of external reviewers - one group of fellow researchers and another group of practitioners in the software industry, so as to determine whether the resulting categories contain distinct types of issues. Moreover, in order to find out if the system can facilitate the search for issue solutions, the web application prototype is tested in a series of user sessions with participants who are not only representative of the main target group which can benefit most from such a system, but who also have a mixture of both practical and theoretical backgrounds. The results of this research demonstrate that the proposed solution can effectively categorize issues according to their type, solely based on the user generated free-text title. This provides strong evidence that natural language processing and machine learning techniques can be utilized for analyzing issues and automating the overall learning process. However, the study was unable to conclusively determine whether these same methods can aid the search for issue solutions. Nevertheless, the thesis provides a detailed account of how this problem was addressed and can therefore serve as the basis for future research.

APA, Harvard, Vancouver, ISO, and other styles

15

Olin, Per. "Evaluation of text classification techniques for log file classification." Thesis, Linköpings universitet, Institutionen för datavetenskap, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-166641.

Full text

Abstract:

System log files are filled with logged events, status codes, and other messages. By analyzing the log files, the systems current state can be determined, and find out if something during its execution went wrong. Log file analysis has been studied for some time now, where recent studies have shown state-of-the-art performance using machine learning techniques. In this thesis, document classification solutions were tested on log files in order to classify regular system runs versus abnormal system runs. To solve this task, supervised and unsupervised learning methods were combined. Doc2Vec was used to extract document features, and Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) based architectures on the classification task. With the use of the machine learning models and preprocessing techniques the tested models yielded an f1-score and accuracy above 95% when classifying log files.

APA, Harvard, Vancouver, ISO, and other styles

16

Silveira, Fausto Magalhães da. "Terminologia e tradução na localização de software : insumos para o processamento da linguagem natural." reponame:Biblioteca Digital de Teses e Dissertações da UFRGS, 2013. http://hdl.handle.net/10183/79460.

Full text

Abstract:

Este trabalho centra-se no processo de QA (na sigla em inglês para quality assurance – ou garantia da qualidade em português) que é feito no setor da localização, visando a melhorar o trabalho do tradutor. Localização consiste em um processo e um campo de atuação profissio-nal que visam a adaptar produtos (geralmente de software) segundo o idioma e as convenções culturais de determinada localidade com o objetivo facilitar a entrada de um produto ou servi-ço em um país ou mercado. Com relação ao QA, uma de suas etapas consiste na validação da terminologia de um projeto de tradução. O QA terminológico envolve o uso de um software que verifica se a terminologia aplicável é usada na tradução. As ocorrências que o software considera incorretas são salvas em uma lista de validação terminológica, que é conferida nor-malmente por um tradutor ou editor. Itens que o tradutor considerar incorretos são corrigidos na tradução; os demais são descartados. Por ignorar aspectos linguísticos, o software gera muito ruído, ou falsos positivos, resultando em listas extensas, que não compensam o tempo dedicado a sua revisão. A fim de prover insumos para solucionar o problema, este trabalho emprega uma abordagem comunicativa, cognitiva e funcional à terminologia e à tradução para analisar uma lista de validação terminológica, em um projeto de localização real, no par de idiomas inglês dos Estados Unidos e português do Brasil. Para tal fim, foi gerada uma lista de validação por meio de um software de QA usado na área da localização. Ocorrências dessa lista foram analisadas e classificadas segundo critérios de base fraseológica, variacional e tra-dutória, além de morfológica e discursiva. O objetivo é oferecer subsídios que norteiem o desenvolvimento de aplicações computacionais linguisticamente motivadas que reduzam a incidência de ruído nestas listas. Os resultados mostram que a maior parte do ruído decorre de fatores linguísticos gerais, como morfológicos e discursivos, indicando também que 1/3 des-tes coocorrem com fenômenos fraseológicos, variacionais e tradutórios.
This paper focuses on the process of Quality Assurance (QA) that is undertaken by the Local-ization industry, aiming at improving the work of translators. Location consists of a process and a professional field whose purpose is to adapt goods or services (usually software-related) according to the language and cultural conventions of a particular locale in order to facilitate market penetration in a given country or market. One of the QA stages consists of validating the terminology on a translation project. The QA for terminology makes use of software to check if the applicable terminology is used in translation. Occurrences that the software iden-tifies as incorrect are saved in a list for terminology validation. The list is usually reviewed by a translator or an editor. The items considered incorrect by the translator are corrected in the translation, and the remaining entries are discarded. Because the software does not take lan-guage aspects into account, a good deal of noise is generated, resulting in large lists that are not cost-effective or time-efficient to review. With the purpose of providing input to solve the problem, this work employs a communicative, cognitive and functional approach to terminol-ogy and translation for the analysis of a terminology validation list in U.S. English and Brazil-ian Portuguese, on a genuine localization project. To complete this task, a list for validation was generated via a well-known QA software product used in the Localization field. Occur-rences from the generated list were analyzed and categorized according to phraseological, variational and translational criteria in addition to morphological and discursive criteria. The objective is providing input to drive the development of linguistically motivated computer applications that may reduce the incidence of noise on the lists. Results show that most of the noise is due to general linguistic factors, such as morphological and discourse aspects, also suggesting that 1/3 of that noise occurs simultaneously with phraseological, variational and translational phenomena.

APA, Harvard, Vancouver, ISO, and other styles

17

Tovedal, Sofiea. "On The Effectiveness of Multi-TaskLearningAn evaluation of Multi-Task Learning techniques in deep learning models." Thesis, Umeå universitet, Institutionen för datavetenskap, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-172257.

Full text

Abstract:

Multi-Task Learning is today an interesting and promising field which many mention as a must for achieving the next level advancement within machine learning. However, in reality, Multi-Task Learning is much more rarely used in real-world implementations than its more popular cousin Transfer Learning. The questionis why that is and if Multi-Task Learning outperforms its Single-Task counterparts. In this thesis different Multi-Task Learning architectures were utilized in order to build a model that can handle labeling real technical issues within two categories. The model faces a challenging imbalanced data set with many labels to choose from and short texts to base its predictions on. Can task-sharing be the answer to these problems? This thesis investigated three Multi-Task Learning architectures and compared their performance to a Single-Task model. An authentic data set and two labeling tasks was used in training the models with the method of supervised learning. The four model architectures; Single-Task, Multi-Task, Cross-Stitched and the Shared-Private, first went through a hyper parameter tuning process using one of the two layer options LSTM and GRU. They were then boosted by auxiliary tasks and finally evaluated against each other.

APA, Harvard, Vancouver, ISO, and other styles

18

Andreani, Vanessa. "Immersion dans des documents scientifiques et techniques : unités, modèles théoriques et processus." Phd thesis, Université de Grenoble, 2011. http://tel.archives-ouvertes.fr/tel-00662668.

Full text

Abstract:

Cette thèse aborde la problématique de l'accès à l'information scientifique et technique véhiculée par de grands ensembles documentaires. Pour permettre à l'utilisateur de trouver l'information qui lui est pertinente, nous avons oeuvré à la définition d'un modèle répondant à l'exigence de souplesse de notre contexte applicatif industriel ; nous postulons pour cela la nécessité de segmenter l'information tirée des documents en plans ontologiques. Le modèle résultant permet une immersion documentaire, et ce grâce à trois types de processus complémentaires : des processus endogènes (exploitant le corpus pour analyser le corpus), exogènes (faisant appel à des ressources externes) et anthropogènes (dans lesquels les compétences de l'utilisateur sont considérées comme ressource) sont combinés. Tous concourent à l'attribution d'une place centrale à l'utilisateur dans le système, en tant qu'agent interprétant de l'information et concepteur de ses connaissances, dès lors qu'il est placé dans un contexte industriel ou spécialisé.

APA, Harvard, Vancouver, ISO, and other styles

19

Marzinotto, Gabriel. "Semantic frame based analysis using machine learning techniques : improving the cross-domain generalization of semantic parsers." Electronic Thesis or Diss., Aix-Marseille, 2019. http://www.theses.fr/2019AIXM0483.

Full text

Abstract:

Rendre les analyseurs sémantiques robustes aux variations lexicales et stylistiques est un véritable défi pour de nombreuses applications industrielles. De nos jours, l'analyse sémantique nécessite de corpus annotés spécifiques à chaque domaine afin de garantir des performances acceptables. Les techniques d'apprenti-ssage par transfert sont largement étudiées et adoptées pour résoudre ce problème de manque de robustesse et la stratégie la plus courante consiste à utiliser des représentations de mots pré-formés. Cependant, les meilleurs analyseurs montrent toujours une dégradation significative des performances lors d'un changement de domaine, mettant en évidence la nécessité de stratégies d'apprentissage par transfert supplémentaires pour atteindre la robustesse. Ce travail propose une nouvelle référence pour étudier le problème de dépendance de domaine dans l'analyse sémantique. Nous utilisons un nouveau corpus annoté pour évaluer les techniques classiques d'apprentissage par transfert et pour proposer et évaluer de nouvelles techniques basées sur les réseaux antagonistes. Toutes ces techniques sont testées sur des analyseurs sémantiques de pointe. Nous affirmons que les approches basées sur les réseaux antagonistes peuvent améliorer les capacités de généralisation des modèles. Nous testons cette hypothèse sur différents schémas de représentation sémantique, langages et corpus, en fournissant des résultats expérimentaux à l'appui de notre hypothèse
Making semantic parsers robust to lexical and stylistic variations is a real challenge with many industrial applications. Nowadays, semantic parsing requires the usage of domain-specific training corpora to ensure acceptable performances on a given domain. Transfer learning techniques are widely studied and adopted when addressing this lack of robustness, and the most common strategy is the usage of pre-trained word representations. However, the best parsers still show significant performance degradation under domain shift, evidencing the need for supplementary transfer learning strategies to achieve robustness. This work proposes a new benchmark to study the domain dependence problem in semantic parsing. We use this bench to evaluate classical transfer learning techniques and to propose and evaluate new techniques based on adversarial learning. All these techniques are tested on state-of-the-art semantic parsers. We claim that adversarial learning approaches can improve the generalization capacities of models. We test this hypothesis on different semantic representation schemes, languages and corpora, providing experimental results to support our hypothesis

APA, Harvard, Vancouver, ISO, and other styles

20

Hou, Tianjun. "L’analyse des commentaires de client : Comment obtenir les informations utiles pour l’innovation et l’amélioration de produit." Thesis, Université Paris-Saclay (ComUE), 2018. http://www.theses.fr/2018SACLC095/document.

Full text

Abstract:

Avec le développement du commerceélectronique, les clients ont publié de nombreuxcommentaires de produit sur Internet. Ces donnéessont précieuses pour les concepteurs de produit, carles informations concernant les besoins de client sontidentifiables. L'objectif de cette étude est dedévelopper une approche d'analyse automatique descommentaires utilisateurs permettant d'obtenir desinformations utiles au concepteur pour guiderl'amélioration et l'innovation des produits.L’approche proposée contient deux étapes :structuration des données et analyse des données.Dans la structuration des données, l’auteur proposed’abord une ontologie pour organiser les mots et lesexpressions concernant les besoins de client décrientdans les commentaires. Ensuite, une méthode detraitement du langage naturelle basée des règleslinguistiques est proposé pour structurerautomatiquement les textes de commentaires dansl’ontologie proposée.Dans l’analyse des données, deux méthodes sontproposées pour obtenir des idées d’innovation et desvisions sur le changement de préférence d’utilisateuravec le temps. Dans ces deux méthodes, les modèleset les méthodes traditionnelles comme affordancebasedesign, l’analyse conjointe, et le Kano modelsont étudié et appliqué d’une façon innovante.Pour évaluer la praticabilité de l’approche proposéedans la réalité, les commentaires de client de liseusenumérique Kindle sont analysés. Des pistesd’innovation et des stratégies pour améliorer leproduit sont identifiés et construites
With the development of e-commerce,consumers have posted large number of onlinereviews on the internet. These user-generated dataare valuable for product designers, as informationconcerning user requirements and preference can beidentified.The objective of this study is to develop an approachto guide product design by analyzing automaticallyonline reviews. The proposed approach consists oftwo steps: data structuration and data analytics.In data structuration, the author firstly proposes anontological model to organize the words andexpressions concerning user requirements in reviewtext. Then, a rule-based natural language processingmethod is proposed to automatically structure reviewtext into the propose ontology.In data analytics, two methods are proposed based onthe structured review data to provide designers ideason innovation and to draw insights on the changes ofuser preference over time. In these two methods,traditional affordance-based design, conjointanalysis, the Kano model are studied andinnovatively applied in the context of big data.To evaluate the practicability of the proposedapproach, the online reviews of Kindle e-readers aredownloaded and analyzed, based on which theinnovation path and the strategies for productimprovement are identified and constructed

APA, Harvard, Vancouver, ISO, and other styles

21

Bustos, Aurelia. "Extraction of medical knowledge from clinical reports and chest x-rays using machine learning techniques." Doctoral thesis, Universidad de Alicante, 2019. http://hdl.handle.net/10045/102193.

Full text

Abstract:

This thesis addresses the extraction of medical knowledge from clinical text using deep learning techniques. In particular, the proposed methods focus on cancer clinical trial protocols and chest x-rays reports. The main results are a proof of concept of the capability of machine learning methods to discern which are regarded as inclusion or exclusion criteria in short free-text clinical notes, and a large scale chest x-ray image dataset labeled with radiological findings, diagnoses and anatomic locations. Clinical trials provide the evidence needed to determine the safety and effectiveness of new medical treatments. These trials are the basis employed for clinical practice guidelines and greatly assist clinicians in their daily practice when making decisions regarding treatment. However, the eligibility criteria used in oncology trials are too restrictive. Patients are often excluded on the basis of comorbidity, past or concomitant treatments and the fact they are over a certain age, and those patients that are selected do not, therefore, mimic clinical practice. This signifies that the results obtained in clinical trials cannot be extrapolated to patients if their clinical profiles were excluded from the clinical trial protocols. The efficacy and safety of new treatments for patients with these characteristics are not, therefore, defined. Given the clinical characteristics of particular patients, their type of cancer and the intended treatment, discovering whether or not they are represented in the corpus of available clinical trials requires the manual review of numerous eligibility criteria, which is impracticable for clinicians on a daily basis. In this thesis, a large medical corpora comprising all cancer clinical trials protocols in the last 18 years published by competent authorities was used to extract medical knowledge in order to help automatically learn patient’s eligibility in these trials. For this, a model is built to automatically predict whether short clinical statements were considered inclusion or exclusion criteria. A method based on deep neural networks is trained on a dataset of 6 million short free-texts to classify them between elegible or not elegible. For this, pretrained word embeddings were used as inputs in order to predict whether or not short free-text statements describing clinical information were considered eligible. The semantic reasoning of the word-embedding representations obtained was also analyzed, being able to identify equivalent treatments for a type of tumor in an analogy with the drugs used to treat other tumors. Results show that representation learning using deep neural networks can be successfully leveraged to extract the medical knowledge from clinical trial protocols and potentially assist practitioners when prescribing treatments. The second main task addressed in this thesis is related to knowledge extraction from medical reports associated with radiographs. Conventional radiology remains the most performed technique in radiodiagnosis services, with a percentage close to 75% (Radiología Médica, 2010). In particular, chest x-ray is the most common medical imaging exam with over 35 million taken every year in the US alone (Kamel et al., 2017). They allow for inexpensive screening of several pathologies including masses, pulmonary nodules, effusions, cardiac abnormalities and pneumothorax. For this task, all the chest-x rays that had been interpreted and reported by radiologists at the Hospital Universitario de San Juan (Alicante) from Jan 2009 to Dec 2017 were used to build a novel large-scale dataset in which each high-resolution radiograph is labeled with its corresponding metadata, radiological findings and pathologies. This dataset, named PadChest, includes more than 160,000 images obtained from 67,000 patients, covering six different position views and additional information on image acquisition and patient demography. The free text reports written in Spanish by radiologists were labeled with 174 different radiographic findings, 19 differential diagnoses and 104 anatomic locations organized as a hierarchical taxonomy and mapped onto standard Unified Medical Language System (UMLS) terminology. For this, a subset of the reports (a 27%) were manually annotated by trained physicians, whereas the remaining set was automatically labeled with deep supervised learning methods using attention mechanisms and fed with the text reports. The labels generated were then validated in an independent test set achieving a 0.93 Micro-F1 score. To the best of our knowledge, this is one of the largest public chest x-ray databases suitable for training supervised models concerning radiographs, and also the first to contain radiographic reports in Spanish. The PadChest dataset can be downloaded on request from http://bimcv.cipf.es/bimcv-projects/padchest/. PadChest is intended for training image classifiers based on deep learning techniques to extract medical knowledge from chest x-rays. It is essential that automatic radiology reporting methods could be integrated in a clinically validated manner in radiologists’ workflow in order to help specialists to improve their efficiency and enable safer and actionable reporting. Computer vision methods capable of identifying both the large spectrum of thoracic abnormalities (and also the normality) need to be trained on large-scale comprehensively labeled large-scale x-ray datasets such as PadChest. The development of these computer vision tools, once clinically validated, could serve to fulfill a broad range of unmet needs. Beyond implementing and obtaining results for both clinical trials and chest x-rays, this thesis studies the nature of the health data, the novelty of applying deep learning methods to obtain large-scale labeled medical datasets, and the relevance of its applications in medical research, which have contributed to its extramural diffusion and worldwide reach. This thesis describes this journey so that the reader is navigated across multiple disciplines, from engineering to medicine up to ethical considerations in artificial intelligence applied to medicine.

APA, Harvard, Vancouver, ISO, and other styles

22

Chanier, Thierry. "Compréhension de textes dans un domaine technique : le système Actes ; application des grammaires d'unification et de la théorie du discours." Paris 13, 1989. http://www.theses.fr/1989PA132015.

Full text

Abstract:

Le système actes offre, d'une part, un formalisme permettant de definir des grammaires et de construire une representation semantique intermediaire des textes, et, d'autre part, un environnement de developpement de grammaires et d'analyse de textes

APA, Harvard, Vancouver, ISO, and other styles

23

Soriano-Morales, Edmundo-Pavel. "Hypergraphs and information fusion for term representation enrichment : applications to named entity recognition and word sense disambiguation." Thesis, Lyon, 2018. http://www.theses.fr/2018LYSE2009/document.

Full text

Abstract:

Donner du sens aux données textuelles est une besoin essentielle pour faire les ordinateurs comprendre notre langage. Pour extraire des informations exploitables du texte, nous devons les représenter avec des descripteurs avant d’utiliser des techniques d’apprentissage. Dans ce sens, le but de cette thèse est de faire la lumière sur les représentations hétérogènes des mots et sur la façon de les exploiter tout en abordant leur nature implicitement éparse.Dans un premier temps, nous proposons un modèle de réseau basé sur des hypergraphes qui contient des données linguistiques hétérogènes dans un seul modèle unifié. En d’autres termes, nous introduisons un modèle qui représente les mots au moyen de différentes propriétés linguistiques et les relie ensemble en fonction desdites propriétés. Notre proposition diffère des autres types de réseaux linguistiques parce que nous visons à fournir une structure générale pouvant contenir plusieurstypes de caractéristiques descriptives du texte, au lieu d’une seule comme dans la plupart des représentations existantes.Cette représentation peut être utilisée pour analyser les propriétés inhérentes du langage à partir de différents points de vue, oupour être le point de départ d’un pipeline de tâches du traitement automatique de langage. Deuxièmement, nous utilisons des techniques de fusion de caractéristiques pour fournir une représentation enrichie unique qui exploite la nature hétérogènedu modèle et atténue l’eparsité de chaque représentation. Ces types de techniques sont régulièrement utilisés exclusivement pour combiner des données multimédia.Dans notre approche, nous considérons différentes représentations de texte comme des sources d’information distinctes qui peuvent être enrichies par elles-mêmes. Cette approche n’a pas été explorée auparavant, à notre connaissance. Troisièmement, nous proposons un algorithme qui exploite les caractéristiques du réseau pour identifier et grouper des mots liés sémantiquement en exploitant les propriétés des réseaux. Contrairement aux méthodes similaires qui sont également basées sur la structure du réseau, notre algorithme réduit le nombre de paramètres requis et surtout, permet l’utilisation de réseaux lexicaux ou syntaxiques pour découvrir les groupes de mots, au lieu d’un type unique des caractéristiques comme elles sont habituellement employées.Nous nous concentrons sur deux tâches différentes de traitement du langage naturel: l’induction et la désambiguïsation des sens des mots (en anglais, Word Sense, Induction and Disambiguation, ou WSI/WSD) et la reconnaissance d’entité nommées(en anglais, Named Entity Recognition, ou NER). Au total, nous testons nos propositions sur quatre ensembles de données différents. Nous effectuons nos expériences et développements en utilisant des corpus à accès libre. Les résultats obtenus nous permettent de montrer la pertinence de nos contributions et nous donnent également un aperçu des propriétés des caractéristiques hétérogènes et de leurs combinaisons avec les méthodes de fusion. Plus précisément, nos expériences sont doubles: premièrement, nous montrons qu’en utilisant des caractéristiques hétérogènes enrichies par la fusion, provenant de notre réseau linguistique proposé, nous surpassons la performance des systèmes à caractéristiques uniques et basés sur la simple concaténation de caractéristiques. Aussi, nous analysons les opérateurs de fusion utilisés afin de mieux comprendre la raison de ces améliorations. En général, l’utilisation indépendante d’opérateurs de fusion n’est pas aussi efficace que l’utilisation d’une combinaison de ceux-ci pour obtenir une représentation spatiale finale. Et deuxièmement, nous abordons encore une fois la tâche WSI/WSD, cette fois-ci avec la méthode à base de graphes proposée afin de démontrer sa pertinence par rapport à la tâche. Nous discutons les différents résultats obtenus avec des caractéristiques lexicales ou syntaxiques
Making sense of textual data is an essential requirement in order to make computers understand our language. To extract actionable information from text, we need to represent it by means of descriptors before using knowledge discovery techniques.The goal of this thesis is to shed light into heterogeneous representations of words and how to leverage them while addressing their implicit sparse nature.First, we propose a hypergraph network model that holds heterogeneous linguistic data in a single unified model. In other words, we introduce a model that represents words by means of different linguistic properties and links them together accordingto said properties. Our proposition differs to other types of linguistic networks in that we aim to provide a general structure that can hold several types of descriptive text features, instead of a single one as in most representations. This representationmay be used to analyze the inherent properties of language from different points of view, or to be the departing point of an applied NLP task pipeline. Secondly, we employ feature fusion techniques to provide a final single enriched representation that exploits the heterogeneous nature of the model and alleviates the sparseness of each representation.These types of techniques are regularly used exclusively to combine multimedia data. In our approach, we consider different text representations as distinct sources of information which can be enriched by themselves. This approach has not been explored before, to the best of our knowledge. Thirdly, we propose an algorithm that exploits the characteristics of the network to identify and group semantically related words by exploiting the real-world properties of the networks. In contrast with similar methods that are also based on the structure of the network, our algorithm reduces the number of required parameters and more importantly, allows for the use of either lexical or syntactic networks to discover said groups of words, instead of the singletype of features usually employed.We focus on two different natural language processing tasks: Word Sense Induction and Disambiguation (WSI/WSD), and Named Entity Recognition (NER). In total, we test our propositions on four different open-access datasets. The results obtained allow us to show the pertinence of our contributions and also give us some insights into the properties of heterogeneous features and their combinations with fusion methods. Specifically, our experiments are twofold: first, we show that using fusion-enriched heterogeneous features, coming from our proposed linguistic network, we outperform the performance of single features’ systems and other basic baselines. We note that using single fusion operators is not efficient compared to using a combination of them in order to obtain a final space representation. We show that the features added by each combined fusion operation are important towards the models predicting the appropriate classes. We test the enriched representations on both WSI/WSD and NER tasks. Secondly, we address the WSI/WSD task with our network-based proposed method. While based on previous work, we improve it by obtaining better overall performance and reducing the number of parameters needed. We also discuss the use of either lexical or syntactic networks to solve the task.Finally, we parse a corpus based on the English Wikipedia and then store it following the proposed network model. The parsed Wikipedia version serves as a linguistic resource to be used by other researchers. Contrary to other similar resources, insteadof just storing its part of speech tag and its dependency relations, we also take into account the constituency-tree information of each word analyzed. The hope is for this resource to be used on future developments without the need to compile suchresource from zero

APA, Harvard, Vancouver, ISO, and other styles

24

Matsubara, Shigeki. "Corpus-based Natural Language Processing." INTELLIGENT MEDIA INTEGRATION NAGOYA UNIVERSITY / COE, 2004. http://hdl.handle.net/2237/10355.

Full text

APA, Harvard, Vancouver, ISO, and other styles

25

Smith, Sydney. "Approaches to Natural Language Processing." Scholarship @ Claremont, 2018. http://scholarship.claremont.edu/cmc_theses/1817.

Full text

Abstract:

This paper explores topic modeling through the example text of Alice in Wonderland. It explores both singular value decomposition as well as non-‐‑negative matrix factorization as methods for feature extraction. The paper goes on to explore methods for partially supervised implementation of topic modeling through introducing themes. A large portion of the paper also focuses on implementation of these techniques in python as well as visualizations of the results which use a combination of python, html and java script along with the d3 framework. The paper concludes by presenting a mixture of SVD, NMF and partially-‐‑supervised NMF as a possible way to improve topic modeling.

APA, Harvard, Vancouver, ISO, and other styles

26

Effa, Bella Emma. "Apports des techniques d’apprentissage semi-supervisées dans l’établissement de liens entre artefacts de conception." Electronic Thesis or Diss., Sorbonne université, 2019. https://accesdistant.sorbonne-universite.fr/login?url=https://theses-intra.sorbonne-universite.fr/2019SORUS093.pdf.

Full text

Abstract:

Dans un environnement collaboratif de développement de systèmes complexes, plusieurs entreprises doivent échanger un nombre important de modèles hétérogènes et d’exigences. Durant les phases du cycle de vie du système, ces artefacts, reliés les uns aux autres et issus de différents outils de modélisations, évoluent constamment. Dans un tel environnement, il est crucial de gérer l’impact des différents changements se produisant dans les différents espaces de conception. La traçabilité répond à ce besoin. Toutefois, établir des liens entre des exigences et des modèles en ingénierie des systèmes complexes suppose de faire face à une volumétrie importante des artefacts. Par exemple, pour une spécification d’un véhicule autonome comprenant 3 000 exigences et 400 éléments de modèles, il faudrait en théorie vérifier de l’ordre d’un million de liens potentiels. Bien que plusieurs approches aient été proposées pour l’identification des liens de traçabilité, le processus de validation des liens est toujours chronophage et générateur d’erreurs. Dans cette thèse, nous proposons une approche semi supervisée qui permet d’apprendre via un modèle probabiliste à reconnaître des liens de traçabilité valides ou non valides à partir de mesures syntaxiques et sémantiques. Cette approche fournit ainsi une mesure quantitative de confiance sur chaque lien candidat. Cette dernière permet potentiellement à l’expert en phase de validation d’optimiser son effort de vérification des liens tout en maîtrisant les risques d’erreur
During the development of complex systems, several enterprises exchange a large number of heterogeneous models and requirements. During the phases of the system’s life cycle, these artifacts, linked to each other and derived from different modelling tools, are constantly evolving. In such environment, it is necessary to manage the impact of the different changes occurring in the different design spaces. Traceability meets this need. However, establishing links between requirements and models in complex systems engineering requires dealing with a large volume of artifacts. For example, a specification of an autonomous vehicle with 3,000 requirements and 400 model elements, it would theoretically be necessary to check about one million of potential links. Although several approaches have been proposed for identifying traceability links, the validation process is always time-consuming and error-prone. This is mainly due to the predominance of manual operations during this process. In this thesis, we propose a semi-supervised approach that learns through a probabilistic model to recognize links or no links from similarity measures and scores. This approach provides a quantitative confidence measure on each candidate link. This measure allows the expert in the validation phase to optimize his verification effort while reducing the risks of error. The evaluation's result show that our approach have better results than state-of-the-art traceability methods. We obtain a reduction of no links (false positive) of about 80% compared to state-of-the-art methods in industrial cases, while, keeping a number of links (true positive), up to 75%, at the same time

APA, Harvard, Vancouver, ISO, and other styles

27

Strandberg, Aron, and Patrik Karlström. "Processing Natural Language for the Spotify API : Are sophisticated natural language processing algorithms necessary when processing language in a limited scope?" Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-186867.

Full text

Abstract:

Knowing whether you can implement something complex in a simple way in your application is always of interest. A natural language interface is some- thing that could theoretically be implemented in a lot of applications but the complexity of most natural language processing algorithms is a limiting factor. The problem explored in this paper is whether a simpler algorithm that doesn’t make use of convoluted statistical models and machine learning can be good enough. We implemented two algorithms, one utilizing Spotify’s own search and one with a more accurate, o✏ine search. With the best precision we could muster being 81% at an average of 2,28 seconds per query this is not a viable solution for a complete and satisfactory user experience. Further work could push the performance into an acceptable range.

APA, Harvard, Vancouver, ISO, and other styles

28

Chen, Joseph C. H. "Quantum computation and natural language processing." [S.l.] : [s.n.], 2002. http://deposit.ddb.de/cgi-bin/dokserv?idn=965581020.

Full text

APA, Harvard, Vancouver, ISO, and other styles

29

Knight, Sylvia Frances. "Natural language processing for aerospace documentation." Thesis, University of Cambridge, 2001. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.621395.

Full text

APA, Harvard, Vancouver, ISO, and other styles

30

Naphtal, Rachael (Rachael M. ). "Natural language processing based nutritional application." Thesis, Massachusetts Institute of Technology, 2015. http://hdl.handle.net/1721.1/100640.

Full text

Abstract:

Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2015.
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Cataloged from student-submitted PDF version of thesis.
Includes bibliographical references (pages 67-68).
The ability to accurately and eciently track nutritional intake is a powerful tool in combating obesity and other food related diseases. Currently, many methods used for this task are time consuming or easily abandoned; however, a natural language based application that converts spoken text to nutritional information could be a convenient and eective solution. This thesis describes the creation of an application that translates spoken food diaries into nutritional database entries. It explores dierent methods for solving the problem of converting brands, descriptions and food item names into entries in nutritional databases. Specifically, we constructed a cache of over 4,000 food items, and also created a variety of methods to allow refinement of database mappings. We also explored methods of dealing with ambiguous quantity descriptions and the mapping of spoken quantity values to numerical units. When assessed by 500 users entering their daily meals on Amazon Mechanical Turk, the system was able to map 83.8% of the correctly interpreted spoken food items to relevant nutritional database entries. It was also able to nd a logical quantity for 92.2% of the correct food entries. Overall, this system shows a signicant step towards the intelligent conversion of spoken food diaries to actual nutritional feedback.
by Rachael Naphtal.
M. Eng.

APA, Harvard, Vancouver, ISO, and other styles

31

Bowden, T. G. "Natural language techniques for error correction." Thesis, University of Cambridge, 1997. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.596815.

Full text

Abstract:

Dealing with human errors such as spelling or grammar mistakes is a necessary part of natural language processing. The aim of this project was to investigate how far error detection and correction could proceed when the system purview was set a sub-sentential stretch of text. This restriction comes from cooperative error handling: detecting/correcting errors just after user entry, as the user is entering further text. Short context, or shallow, processing is also interesting because it is potentially cheaper and faster than a full-scale parse and because sentential constraints become less reliable when the 'sentence' is ill-formed. There has been no previous report on the effectiveness of local syntactic constraints on general (English) ill-formedness. Additionally all error processing programmes, other than some working in very restricted domains, have been post-processors rather than cooperative. Being post-processors, previous programs have been concerned with errors left undetected, after some degree of proofreading. Cooperative processing is also aimed at the errors people spend time backtracking to catch. In the absence of existent suitable data, a corpus of keystrokes made by subjects entering a piece of text was collated; errors were classified as caught or uncaught and various interesting analyses emerged. For context-less processing, a method based on morphological error rules and another on binary positional trigrams were devised and compared. Then to incorporate context, local syntactic constraints based on tag information were implemented, using bigram and triggram co-occurrence checks with a Markov tagging procedure. The tag-based constraints were compared with a partial parsing method. These error handlers were evaluated on data from the Keystroke Corpus and on other data manufactured and collected. The morphological error rules and tag-based checks using very short context were the most promising. As far as current comparison allows, there being a scarcity of reported results in this area, the short context techniques implemented here compared well against full-parsing error handlers. Ideas outlined for future work include a method for further identifying detected word scope errors and a practical, usable cooperative corrector based on an extension of an existing commercial application.

APA, Harvard, Vancouver, ISO, and other styles

32

Eriksson, Simon. "COMPARING NATURAL LANGUAGE PROCESSING TO STRUCTURED QUERY LANGUAGE ALGORITHMS." Thesis, Umeå universitet, Institutionen för datavetenskap, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-163310.

Full text

Abstract:

Using natural language processing to create Structured Query Language (SQL) queries has many benefits in theory. Even though SQL is an expressive and powerful language it requires certain technical knowledge to use. An interface effectively utilizing natural language processing would instead allow the user to communicate with the SQL database as if they were communicating with another human being. In this paper I compare how two of the currently most advanced open source algorithms (TypeSQL and SyntaxSQL) in this field can understandadvanced SQL. I show that SyntaxSQL is signicantly more accurate but makes some sacrices in execution time compared to TypeSQL.

APA, Harvard, Vancouver, ISO, and other styles

33

Kesarwani, Vaibhav. "Automatic Poetry Classification Using Natural Language Processing." Thesis, Université d'Ottawa / University of Ottawa, 2018. http://hdl.handle.net/10393/37309.

Full text

Abstract:

Poetry, as a special form of literature, is crucial for computational linguistics. It has a high density of emotions, figures of speech, vividness, creativity, and ambiguity. Poetry poses a much greater challenge for the application of Natural Language Processing algorithms than any other literary genre. Our system establishes a computational model that classifies poems based on similarity features like rhyme, diction, and metaphor. For rhyme analysis, we investigate the methods used to classify poems based on rhyme patterns. First, the overview of different types of rhymes is given along with the detailed description of detecting rhyme type and sub-types by the application of a pronunciation dictionary on our poetry dataset. We achieve an accuracy of 96.51% in identifying rhymes in poetry by applying a phonetic similarity model. Then we achieve a rhyme quantification metric RhymeScore based on the matching phonetic transcription of each poem. We also develop an application for the visualization of this quantified RhymeScore as a scatter plot in 2 or 3 dimensions. For diction analysis, we investigate the methods used to classify poems based on diction. First the linguistic quantitative and semantic features that constitute diction are enumerated. Then we investigate the methodology used to compute these features from our poetry dataset. We also build a word embeddings model on our poetry dataset with 1.5 million words in 100 dimensions and do a comparative analysis with GloVe embeddings. Metaphor is a part of diction, but as it is a very complex topic in its own right, we address it as a stand-alone issue and develop several methods for it. Previous work on metaphor detection relies on either rule-based or statistical models, none of them applied to poetry. Our methods focus on metaphor detection in a poetry corpus, but we test on non-poetry data as well. We combine rule-based and statistical models (word embeddings) to develop a new classification system. Our first metaphor detection method achieves a precision of 0.759 and a recall of 0.804 in identifying one type of metaphor in poetry, by using a Support Vector Machine classifier with various types of features. Furthermore, our deep learning model based on a Convolutional Neural Network achieves a precision of 0.831 and a recall of 0.836 for the same task. We also develop an application for generic metaphor detection in any type of natural text.

APA, Harvard, Vancouver, ISO, and other styles

34

Pham, Son Bao Computer Science &amp Engineering Faculty of Engineering UNSW. "Incremental knowledge acquisition for natural language processing." Awarded by:University of New South Wales. School of Computer Science and Engineering, 2006. http://handle.unsw.edu.au/1959.4/26299.

Full text

Abstract:

Linguistic patterns have been used widely in shallow methods to develop numerous NLP applications. Approaches for acquiring linguistic patterns can be broadly categorised into three groups: supervised learning, unsupervised learning and manual methods. In supervised learning approaches, a large annotated training corpus is required for the learning algorithms to achieve decent results. However, annotated corpora are expensive to obtain and usually available only for established tasks. Unsupervised learning approaches usually start with a few seed examples and gather some statistics based on a large unannotated corpus to detect new examples that are similar to the seed ones. Most of these approaches either populate lexicons for predefined patterns or learn new patterns for extracting general factual information; hence they are applicable to only a limited number of tasks. Manually creating linguistic patterns has the advantage of utilising an expert's knowledge to overcome the scarcity of annotated data. In tasks with no annotated data available, the manual way seems to be the only choice. One typical problem that occurs with manual approaches is that the combination of multiple patterns, possibly being used at different stages of processing, often causes unintended side effects. Existing approaches, however, do not focus on the practical problem of acquiring those patterns but rather on how to use linguistic patterns for processing text. A systematic way to support the process of manually acquiring linguistic patterns in an efficient manner is long overdue. This thesis presents KAFTIE, an incremental knowledge acquisition framework that strongly supports experts in creating linguistic patterns manually for various NLP tasks. KAFTIE addresses difficulties in manually constructing knowledge bases of linguistic patterns, or rules in general, often faced in existing approaches by: (1) offering a systematic way to create new patterns while ensuring they are consistent; (2) alleviating the difficulty in choosing the right level of generality when creating a new pattern; (3) suggesting how existing patterns can be modified to improve the knowledge base's performance; (4) making the effort in creating a new pattern, or modifying an existing pattern, independent of the knowledge base's size. KAFTIE, therefore, makes it possible for experts to efficiently build large knowledge bases for complex tasks. This thesis also presents the KAFDIS framework for discourse processing using new representation formalisms: the level-of-detail tree and the discourse structure graph.

APA, Harvard, Vancouver, ISO, and other styles

35

張少能 and Siu-nang Bruce Cheung. "A concise framework of natural language processing." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 1989. http://hub.hku.hk/bib/B31208563.

Full text

APA, Harvard, Vancouver, ISO, and other styles

36

Cahill, Lynne Julie. "Syllable-based morphology for natural language processing." Thesis, University of Sussex, 1990. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.386529.

Full text

Abstract:

This thesis addresses the problem of accounting for morphological alternation within Natural Language Processing. It proposes an approach to morphology which is based on phonological concepts, in particular the syllable, in contrast to morpheme-based approaches which have standardly been used by both NLP and linguistics. It is argued that morpheme-based approaches, within both linguistics and NLP, grew out of the apparently purely affixational morphology of European languages, and especially English, but are less appropriate for non-affixational languages such as Arabic. Indeed, it is claimed that even accounts of those European languages miss important linguistic generalizations by ignoring more phonologically based alternations, such as umlaut in German and ablaut in English. To justify this approach, we present a wide range of data from languages as diverse as German and Rotuman. A formal language, MOLUSe, is described, which allows for the definition of declarative mappings between syllable-sequences, and accounts of non-trivial fragments of the inflectional morphology of English, Arabic and Sanskrit are presented, to demonstrate the capabilities of the language. A semantics for the language is defined, and the implementation of an interpreter is described. The thesis discusses theoretical (linguistic) issues, as well as implementational issues involved in the incorporation of MOLUSC into a larger lexicon system. The approach is contrasted with previous work in computational morphology, in particular finite-state morphology, and its relation to other work in the fields of morphology and phonology is also discussed.

APA, Harvard, Vancouver, ISO, and other styles

37

Lei, Tao Ph D. Massachusetts Institute of Technology. "Interpretable neural models for natural language processing." Thesis, Massachusetts Institute of Technology, 2017. http://hdl.handle.net/1721.1/108990.

Full text

Abstract:

Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2017.
Cataloged from PDF version of thesis.
Includes bibliographical references (pages 109-119).
The success of neural network models often comes at a cost of interpretability. This thesis addresses the problem by providing justifications behind the model's structure and predictions. In the first part of this thesis, we present a class of sequence operations for text processing. The proposed component generalizes from convolution operations and gated aggregations. As justifications, we relate this component to string kernels, i.e. functions measuring the similarity between sequences, and demonstrate how it encodes the efficient kernel computing algorithm into its structure. The proposed model achieves state-of-the-art or competitive results compared to alternative architectures (such as LSTMs and CNNs) across several NLP applications. In the second part, we learn rationales behind the model's prediction by extracting input pieces as supporting evidence. Rationales are tailored to be short and coherent, yet sufficient for making the same prediction. Our approach combines two modular components, generator and encoder, which are trained to operate well together. The generator specifies a distribution over text fragments as candidate rationales and these are passed through the encoder for prediction. Rationales are never given during training. Instead, the model is regularized by the desiderata for rationales. We demonstrate the effectiveness of this learning framework in applications such multi-aspect sentiment analysis. Our method achieves a performance over 90% evaluated against manual annotated rationales.
by Tao Lei.
Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

38

Grinman, Alex J. "Natural language processing on encrypted patient data." Thesis, Massachusetts Institute of Technology, 2016. http://hdl.handle.net/1721.1/113438.

Full text

Abstract:

Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2016.
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Cataloged from student-submitted PDF version of thesis.
Includes bibliographical references (pages 85-86).
While many industries can benefit from machine learning techniques for data analysis, they often do not have the technical expertise nor computational power to do so. Therefore, many organizations would benefit from outsourcing their data analysis. Yet, stringent data privacy policies prevent outsourcing sensitive data and may stop the delegation of data analysis in its tracks. In this thesis, we put forth a two-party system where one party capable of powerful computation can run certain machine learning algorithms from the natural language processing domain on the second party's data, where the first party is limited to learning only specific functions of the second party's data and nothing else. Our system provides simple cryptographic schemes for locating keywords, matching approximate regular expressions, and computing frequency analysis on encrypted data. We present a full implementation of this system in the form of a extendible software library and a command line interface. Finally, we discuss a medical case study where we used our system to run a suite of unmodified machine learning algorithms on encrypted free text patient notes.
by Alex J. Grinman.
M. Eng.

APA, Harvard, Vancouver, ISO, and other styles

39

Alharthi, Haifa. "Natural Language Processing for Book Recommender Systems." Thesis, Université d'Ottawa / University of Ottawa, 2019. http://hdl.handle.net/10393/39134.

Full text

Abstract:

The act of reading has benefits for individuals and societies, yet studies show that reading declines, especially among the young. Recommender systems (RSs) can help stop such decline. There is a lot of research regarding literary books using natural language processing (NLP) methods, but the analysis of textual book content to improve recommendations is relatively rare. We propose content-based recommender systems that extract elements learned from book texts to predict readers’ future interests. One factor that influences reading preferences is writing style; we propose a system that recommends books after learning their authors’ writing style. To our knowledge, this is the first work that transfers the information learned by an author-identification model to a book RS. Another approach that we propose uses over a hundred lexical, syntactic, stylometric, and fiction-based features that might play a role in generating high-quality book recommendations. Previous book RSs include very few stylometric features; hence, our study is the first to include and analyze a wide variety of textual elements for book recommendations. We evaluated both approaches according to a top-k recommendation scenario. They give better accuracy when compared with state-of-the-art content and collaborative filtering methods. We highlight the significant factors that contributed to the accuracy of the recommendations using a forest of randomized regression trees. We also conducted a qualitative analysis by checking if similar books/authors were annotated similarly by experts. Our content-based systems suffer from the new user problem, well-known in the field of RSs, that hinders their ability to make accurate recommendations. Therefore, we propose a Topic Model-Based book recommendation component (TMB) that addresses the issue by using the topics learned from a user’s shared text on social media, to recognize their interests and map them to related books. To our knowledge, there is no literature regarding book RSs that exploits public social networks other than book-cataloging websites. Using topic modeling techniques, extracting user interests can be automatic and dynamic, without the need to search for predefined concepts. Though TMB is designed to complement other systems, we evaluated it against a traditional book CB. We assessed the top k recommendations made by TMB and CB and found that both retrieved a comparable number of books, even though CB relied on users’ rating history, while TMB only required their social profiles.

APA, Harvard, Vancouver, ISO, and other styles

40

Medlock, Benjamin William. "Investigating classification for natural language processing tasks." Thesis, University of Cambridge, 2008. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.611949.

Full text

APA, Harvard, Vancouver, ISO, and other styles

41

Huang, Yin Jou. "Event Centric Approaches in Natural Language Processing." Doctoral thesis, Kyoto University, 2021. http://hdl.handle.net/2433/265210.

Full text

APA, Harvard, Vancouver, ISO, and other styles

42

Woldemariam, Yonas Demeke. "Natural language processing in cross-media analysis." Licentiate thesis, Umeå universitet, Institutionen för datavetenskap, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-147640.

Full text

Abstract:

A cross-media analysis framework is an integrated multi-modal platform where a media resource containing different types of data such as text, images, audio and video is analyzed with metadata extractors, working jointly to contextualize the media resource. It generally provides cross-media analysis and automatic annotation, metadata publication and storage, searches and recommendation services. For on-line content providers, such services allow them to semantically enhance a media resource with the extracted metadata representing the hidden meanings and make it more efficiently searchable. Within the architecture of such frameworks, Natural Language Processing (NLP) infrastructures cover a substantial part. The NLP infrastructures include text analysis components such as a parser, named entity extraction and linking, sentiment analysis and automatic speech recognition. Since NLP tools and techniques are originally designed to operate in isolation, integrating them in cross-media frameworks and analyzing textual data extracted from multimedia sources is very challenging. Especially, the text extracted from audio-visual content lack linguistic features that potentially provide important clues for text analysis components. Thus, there is a need to develop various techniques to meet the requirements and design principles of the frameworks. In our thesis, we explore developing various methods and models satisfying text and speech analysis requirements posed by cross-media analysis frameworks. The developed methods allow the frameworks to extract linguistic knowledge of various types and predict various information such as sentiment and competence. We also attempt to enhance the multilingualism of the frameworks by designing an analysis pipeline that includes speech recognition, transliteration and named entity recognition for Amharic, that also enables the accessibility of Amharic contents on the web more efficiently. The method can potentially be extended to support other under-resourced languages.

APA, Harvard, Vancouver, ISO, and other styles

43

Cheung, Siu-nang Bruce. "A concise framework of natural language processing /." [Hong Kong : University of Hong Kong], 1989. http://sunzi.lib.hku.hk/hkuto/record.jsp?B12432544.

Full text

APA, Harvard, Vancouver, ISO, and other styles

44

Miao, Yishu. "Deep generative models for natural language processing." Thesis, University of Oxford, 2017. http://ora.ox.ac.uk/objects/uuid:e4e1f1f9-e507-4754-a0ab-0246f1e1e258.

Full text

Abstract:

Deep generative models are essential to Natural Language Processing (NLP) due to their outstanding ability to use unlabelled data, to incorporate abundant linguistic features, and to learn interpretable dependencies among data. As the structure becomes deeper and more complex, having an effective and efficient inference method becomes increasingly important. In this thesis, neural variational inference is applied to carry out inference for deep generative models. While traditional variational methods derive an analytic approximation for the intractable distributions over latent variables, here we construct an inference network conditioned on the discrete text input to provide the variational distribution. The powerful neural networks are able to approximate complicated non-linear distributions and grant the possibilities for more interesting and complicated generative models. Therefore, we develop the potential of neural variational inference and apply it to a variety of models for NLP with continuous or discrete latent variables. This thesis is divided into three parts. Part I introduces a generic variational inference framework for generative and conditional models of text. For continuous or discrete latent variables, we apply a continuous reparameterisation trick or the REINFORCE algorithm to build low-variance gradient estimators. To further explore Bayesian non-parametrics in deep neural networks, we propose a family of neural networks that parameterise categorical distributions with continuous latent variables. Using the stick-breaking construction, an unbounded categorical distribution is incorporated into our deep generative models which can be optimised by stochastic gradient back-propagation with a continuous reparameterisation. Part II explores continuous latent variable models for NLP. Chapter 3 discusses the Neural Variational Document Model (NVDM): an unsupervised generative model of text which aims to extract a continuous semantic latent variable for each document. In Chapter 4, the neural topic models modify the neural document models by parameterising categorical distributions with continuous latent variables, where the topics are explicitly modelled by discrete latent variables. The models are further extended to neural unbounded topic models with the help of stick-breaking construction, and a truncation-free variational inference method is proposed based on a Recurrent Stick-breaking construction (RSB). Chapter 5 describes the Neural Answer Selection Model (NASM) for learning a latent stochastic attention mechanism to model the semantics of question-answer pairs and predict their relatedness. Part III discusses discrete latent variable models. Chapter 6 introduces latent sentence compression models. The Auto-encoding Sentence Compression Model (ASC), as a discrete variational auto-encoder, generates a sentence by a sequence of discrete latent variables representing explicit words. The Forced Attention Sentence Compression Model (FSC) incorporates a combined pointer network biased towards the usage of words from source sentence, which significantly improves the performance when jointly trained with the ASC model in a semi-supervised learning fashion. Chapter 7 describes the Latent Intention Dialogue Models (LIDM) that employ a discrete latent variable to learn underlying dialogue intentions. Additionally, the latent intentions can be interpreted as actions guiding the generation of machine responses, which could be further refined autonomously by reinforcement learning. Finally, Chapter 8 summarizes our findings and directions for future work.

APA, Harvard, Vancouver, ISO, and other styles

45

Hu, Jin. "Explainable Deep Learning for Natural Language Processing." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-254886.

Full text

Abstract:

Deep learning methods get impressive performance in many Natural Neural Processing (NLP) tasks, but it is still difficult to know what happened inside a deep neural network. In this thesis, a general overview of Explainable AI and how explainable deep learning methods applied for NLP tasks is given. Then the Bi-directional LSTM and CRF (BiLSTM-CRF) model for Named Entity Recognition (NER) task is introduced, as well as the approach to make this model explainable. The approach to visualize the importance of neurons in Bi-LSTM layer of the model for NER by Layer-wise Relevance Propagation (LRP) is proposed, which can measure how neurons contribute to each predictionof a word in a sequence. Ideas about how to measure the influence of CRF layer of the Bi-LSTM-CRF model is also described.
Djupa inlärningsmetoder får imponerande prestanda i många naturliga Neural Processing (NLP) uppgifter, men det är fortfarande svårt att veta vad hände inne i ett djupt neuralt nätverk. I denna avhandling, en allmän översikt av förklarliga AI och hur förklarliga djupa inlärningsmetoder tillämpas för NLP-uppgifter ges. Då den bi-riktiga LSTM och CRF (BiLSTM-CRF) modell för Named Entity Recognition (NER) uppgift införs, liksom tillvägagångssättet för att göra denna modell förklarlig. De tillvägagångssätt för att visualisera vikten av neuroner i BiLSTM-skiktet av Modellen för NER genom Layer-Wise Relevance Propagation (LRP) föreslås, som kan mäta hur neuroner bidrar till varje förutsägelse av ett ord i en sekvens. Idéer om hur man mäter påverkan av CRF-skiktet i Bi-LSTM-CRF-modellen beskrivs också.

APA, Harvard, Vancouver, ISO, and other styles

46

Guy, Alison. "Logical expressions in natural language conditionals." Thesis, University of Sunderland, 1990. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.278644.

Full text

APA, Harvard, Vancouver, ISO, and other styles

47

Walker, Alden. "Natural language interaction with robots." Diss., Connect to the thesis, 2007. http://hdl.handle.net/10066/1275.

Full text

APA, Harvard, Vancouver, ISO, and other styles

48

Fuchs, Gil Emanuel. "Practical natural language processing question answering using graphs /." Diss., Digital Dissertations Database. Restricted to UC campuses, 2004. http://uclibs.org/PID/11984.

Full text

APA, Harvard, Vancouver, ISO, and other styles

49

Kolak, Okan. "Rapid resource transfer for multilingual natural language processing." College Park, Md. : University of Maryland, 2005. http://hdl.handle.net/1903/3182.

Full text

Abstract:

Thesis (Ph. D.) -- University of Maryland, College Park, 2005.
Thesis research directed by: Dept. of Linguistics. Title from t.p. of PDF. Includes bibliographical references. Published by UMI Dissertation Services, Ann Arbor, Mich. Also available in paper.

APA, Harvard, Vancouver, ISO, and other styles

50

Takeda, Koichi. "Building Natural Language Processing Applications Using Descriptive Models." 京都大学 (Kyoto University), 2010. http://hdl.handle.net/2433/120372.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Natural language processing techniques'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles