Log in

Relevant bibliographies by topics / Clinical Natural Language Processing / Dissertations / Theses

To see the other types of publications on this topic, follow the link: Clinical Natural Language Processing.

Dissertations / Theses on the topic 'Clinical Natural Language Processing'

Author: Grafiati

Published: 25 May 2024

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 dissertations / theses for your research on the topic 'Clinical Natural Language Processing.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Chien, Isabel. "Natural language processing for precision clinical diagnostics and treatment." Thesis, Massachusetts Institute of Technology, 2018. http://hdl.handle.net/1721.1/119754.

Full text

Abstract:

Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2018.
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Cataloged from student-submitted PDF version of thesis.
Includes bibliographical references (pages 61-65).
In this thesis, I focus upon application of natural language processing to clinical diagnostics and treatment within the palliative care and serious illness field. I explore a variety of natural language processing methods, including deep learning, rule-based, and classic machine learning, and applied to the identication of documentation reflecting advanced care planning measures, serious illnesses, and serious illness symptoms. I introduce two tools that can be used to analyze clinical notes from electronic health records: ClinicalRegex, a regular expression interface, and PyCCI, an a clinical text annotation tool. Additionally, I discuss a palliative care-focused research project in which I apply machine learning natural language processing methods to identifying clinical documentation in the palliative care and serious illness field. Advance care planning, which includes clarifying and documenting goals of care and preferences for future care, is essential for achieving end-of-life care that is consistent with the preferences of dying patients and their families. Physicians document their communication about these preferences as unstructured free text in clinical notes; as a result, routine assessment of this quality indicator is time consuming and costly. Integrating goals of care conversations and advance care planning into decision-making about palliative surgery have been shown to result in less invasive care near the time of death and improve clinical outcomes for both the patient and surviving family members. Natural language processing methods offer an efficient and scalable way to improve the visibility of documented serious illness conversations within electronic health record data, helping to better quality of care.
by Isabel Chien.
M. Eng.

APA, Harvard, Vancouver, ISO, and other styles

2

Mehrabi, Saeed. "Advanced natural language processing and temporal mining for clinical discovery." Thesis, Indiana University - Purdue University Indianapolis, 2016. http://pqdtopen.proquest.com/#viewpdf?dispub=10032405.

Full text

Abstract:

There has been vast and growing amount of healthcare data especially with the rapid adoption of electronic health records (EHRs) as a result of the HITECH act of 2009. It is estimated that around 80% of the clinical information resides in the unstructured narrative of an EHR. Recently, natural language processing (NLP) techniques have offered opportunities to extract information from unstructured clinical texts needed for various clinical applications. A popular method for enabling secondary uses of EHRs is information or concept extraction, a subtask of NLP that seeks to locate and classify elements within text based on the context. Extraction of clinical concepts without considering the context has many complications, including inaccurate diagnosis of patients and contamination of study cohorts. Identifying the negation status and whether a clinical concept belongs to patients or his family members are two of the challenges faced in context detection. A negation algorithm called Dependency Parser Negation (DEEPEN) has been developed in this research study by taking into account the dependency relationship between negation words and concepts within a sentence using the Stanford Dependency Parser. The study results demonstrate that DEEPEN, can reduce the number of incorrect negation assignment for patients with positive findings, and therefore improve the identification of patients with the target clinical findings in EHRs. Additionally, an NLP system consisting of section segmentation and relation discovery was developed to identify patients’ family history. To assess the generalizability of the negation and family history algorithm, data from a different clinical institution was used in both algorithm evaluations. The temporal dimension of extracted information from clinical records representing the trajectory of disease progression in patients was also studied in this project. Clinical data of patients who lived in Olmsted County (Rochester, MN) during 1966 to 2010 was analyzed in this work. The patient records were modeled by diagnosis matrices with clinical events as rows and their temporal information as columns. Deep learning algorithm was used to find common temporal patterns within these diagnosis matrices.

APA, Harvard, Vancouver, ISO, and other styles

3

Forsyth, Alexander William. "Improving clinical decision making with natural language processing and machine learning." Thesis, Massachusetts Institute of Technology, 2017. http://hdl.handle.net/1721.1/112847.

Full text

Abstract:

Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2017.
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Cataloged from student-submitted PDF version of thesis.
Includes bibliographical references (pages 49-53).
This thesis focused on two tasks of applying natural language processing (NLP) and machine learning to electronic health records (EHRs) to improve clinical decision making. The first task was to predict cardiac resynchronization therapy (CRT) outcomes with better precision than the current physician guidelines for recommending the procedure. We combined NLP features from free-text physician notes with structured data to train a supervised classifier to predict CRT outcomes. While our results gave a slight improvement over the current baseline, we were not able to predict CRT outcome with both high precision and high recall. These results limit the clinical applicability of our model, and reinforce previous work, which also could not find accurate predictors of CRT response. The second task in this thesis was to extract breast cancer patient symptoms during chemotherapy from free-text physician notes. We manually annotated about 10,000 sentences, and trained a conditional random field (CRF) model to predict whether a word indicated a symptom (positive label), specifically indicated the absence of a symptom (negative label), or was neutral. Our final model achieved 0.66, 1.00, and 0.77 F1 scores for predicting positive, neutral, and negative labels respectively. While the F1 scores for positive and negative labels are not extremely high, with the current performance, our model could be applied, for example, to gather better statistics about what symptoms breast cancer patients experience during chemotherapy and at what time points during treatment they experience these symptoms.
by Alexander William Forsyth.
M. Eng.

APA, Harvard, Vancouver, ISO, and other styles

4

Regulapati, Sushmitha. "Natural language processing framework to assist in the evaluation of adherence to clinical guidelines." Morgantown, W. Va. : [West Virginia University Libraries], 2007. https://eidr.wvu.edu/etd/documentdata.eTD?documentid=5340.

Full text

Abstract:

Thesis (M.S.)--West Virginia University, 2007.
Title from document title page. Document formatted into pages; contains vii, 36 p. : ill. (some col.). Includes abstract. Includes bibliographical references (p. 33-36).

APA, Harvard, Vancouver, ISO, and other styles

5

Leonhard, Annette Christa. "Automated question answering for clinical comparison questions." Thesis, University of Edinburgh, 2012. http://hdl.handle.net/1842/6266.

Full text

Abstract:

This thesis describes the development and evaluation of new automated Question Answering (QA) methods tailored to clinical comparison questions that give clinicians a rank-ordered list of MEDLINE® abstracts targeted to natural language clinical drug comparison questions (e.g. ”Have any studies directly compared the effects of Pioglitazone and Rosiglitazone on the liver?”). Three corpora were created to develop and evaluate a new QA system for clinical comparison questions called RetroRank. RetroRank takes the clinician’s plain text question as input, processes it and outputs a rank-ordered list of potential answer candidates, i.e. MEDLINE® abstracts, that is reordered using new post-retrieval ranking strategies to ensure the most topically-relevant abstracts are displayed as high in the result set as possible. RetroRank achieves a significant improvement over the PubMed recency baseline and performs equal to or better than previous approaches to post-retrieval ranking relying on query frames and annotated data such as the approach by Demner-Fushman and Lin (2007). The performance of RetroRank shows that it is possible to successfully use natural language input and a fully automated approach to obtain answers to clinical drug comparison questions. This thesis also introduces two new evaluation corpora of clinical comparison questions with “gold standard” references that are freely available and are a valuable resource for future research in medical QA.

APA, Harvard, Vancouver, ISO, and other styles

6

Eglowski, Skylar. "CREATE: Clinical Record Analysis Technology Ensemble." DigitalCommons@CalPoly, 2017. https://digitalcommons.calpoly.edu/theses/1771.

Full text

Abstract:

In this thesis, we describe an approach that won a psychiatric symptom severity prediction challenge. The challenge was to correctly predict the severity of psychiatric symptoms on a 4-point scale. Our winning submission uses a novel stacked machine learning architecture in which (i) a base data ingestion/cleaning step was followed by the (ii) derivation of a base set of features defined using text analytics, after which (iii) association rule learning was used in a novel way to generate new features, followed by a (iv) feature selection step to eliminate irrelevant features, followed by a (v) classifier training algorithm in which a total of 22 classifiers including new classifier variants of AdaBoost and RandomForest were trained on seven different data views, and (vi) finally an ensemble learning step, in which ensembles of best learners were used to improve on the accuracy of individual learners. All of this was tested via standard 10-fold cross-validation on training data provided by the N-GRID challenge organizers, of which the three best ensembles were selected for submission to N-GRID's blind testing. The best of our submitted solutions garnered an overall final score of 0.863 according to the organizer's measure. All 3 of our submissions placed within the top 10 out of the 65 total submissions. The challenge constituted Track 2 of the 2016 Centers of Excellence in Genomic Science (CEGS) Neuropsychiatric Genome-Scale and RDOC Individualized Domains (N-GRID) Shared Task in Clinical Natural Language Processing.

APA, Harvard, Vancouver, ISO, and other styles

7

Wang, Yefeng. "Information extraction from clinical notes." Thesis, The University of Sydney, 2010. https://hdl.handle.net/2123/28844.

Full text

Abstract:

Information Extraction (IE) is an important task for Natural Language Processing (NLP). Effective IE methods, aimed at constructing structured information for unstructured natural language text, can reduce a large amount of human effort in processing the digital information available today. Successful application of IE to the clinical domain can advance clinical research and provide underlying techniques to support better health information systems. This thesis investigates the problems of IE from clinical notes.

APA, Harvard, Vancouver, ISO, and other styles

8

Henriksson, Aron. "Semantic Spaces of Clinical Text : Leveraging Distributional Semantics for Natural Language Processing of Electronic Health Records." Licentiate thesis, Stockholms universitet, Institutionen för data- och systemvetenskap, 2013. http://urn.kb.se/resolve?urn=urn:nbn:se:su:diva-94344.

Full text

Abstract:

The large amounts of clinical data generated by electronic health record systems are an underutilized resource, which, if tapped, has enormous potential to improve health care. Since the majority of this data is in the form of unstructured text, which is challenging to analyze computationally, there is a need for sophisticated clinical language processing methods. Unsupervised methods that exploit statistical properties of the data are particularly valuable due to the limited availability of annotated corpora in the clinical domain. Information extraction and natural language processing systems need to incorporate some knowledge of semantics. One approach exploits the distributional properties of language – more specifically, term co-occurrence information – to model the relative meaning of terms in high-dimensional vector space. Such methods have been used with success in a number of general language processing tasks; however, their application in the clinical domain has previously only been explored to a limited extent. By applying models of distributional semantics to clinical text, semantic spaces can be constructed in a completely unsupervised fashion. Semantic spaces of clinical text can then be utilized in a number of medically relevant applications. The application of distributional semantics in the clinical domain is here demonstrated in three use cases: (1) synonym extraction of medical terms, (2) assignment of diagnosis codes and (3) identification of adverse drug reactions. To apply distributional semantics effectively to a wide range of both general and, in particular, clinical language processing tasks, certain limitations or challenges need to be addressed, such as how to model the meaning of multiword terms and account for the function of negation: a simple means of incorporating paraphrasing and negation in a distributional semantic framework is here proposed and evaluated. The notion of ensembles of semantic spaces is also introduced; these are shown to outperform the use of a single semantic space on the synonym extraction task. This idea allows different models of distributional semantics, with different parameter configurations and induced from different corpora, to be combined. This is not least important in the clinical domain, as it allows potentially limited amounts of clinical data to be supplemented with data from other, more readily available sources. The importance of configuring the dimensionality of semantic spaces, particularly when – as is typically the case in the clinical domain – the vocabulary grows large, is also demonstrated.
De stora mängder kliniska data som genereras i patientjournalsystem är en underutnyttjad resurs med en enorm potential att förbättra hälso- och sjukvården. Då merparten av kliniska data är i form av ostrukturerad text, vilken är utmanande för datorer att analysera, finns det ett behov av sofistikerade metoder som kan behandla kliniskt språk. Metoder som inte kräver märkta exempel utan istället utnyttjar statistiska egenskaper i datamängden är särskilt värdefulla, med tanke på den begränsade tillgången till annoterade korpusar i den kliniska domänen. System för informationsextraktion och språkbehandling behöver innehålla viss kunskap om semantik. En metod går ut på att utnyttja de distributionella egenskaperna hos språk – mer specifikt, statistisk över hur termer samförekommer – för att modellera den relativa betydelsen av termer i ett högdimensionellt vektorrum. Metoden har använts med framgång i en rad uppgifter för behandling av allmänna språk; dess tillämpning i den kliniska domänen har dock endast utforskats i mindre utsträckning. Genom att tillämpa modeller för distributionell semantik på klinisk text kan semantiska rum konstrueras utan någon tillgång till märkta exempel. Semantiska rum av klinisk text kan sedan användas i en rad medicinskt relevanta tillämpningar. Tillämpningen av distributionell semantik i den kliniska domänen illustreras här i tre användningsområden: (1) synonymextraktion av medicinska termer, (2) tilldelning av diagnoskoder och (3) identifiering av läkemedelsbiverkningar. Det krävs dock att vissa begränsningar eller utmaningar adresseras för att möjliggöra en effektiv tillämpning av distributionell semantik på ett brett spektrum av uppgifter som behandlar språk – både allmänt och, i synnerhet, kliniskt – såsom hur man kan modellera betydelsen av flerordstermer och redogöra för funktionen av negation: ett enkelt sätt att modellera parafrasering och negation i ett distributionellt semantiskt ramverk presenteras och utvärderas. Idén om ensembler av semantisk rum introduceras också; dessa överträffer användningen av ett enda semantiskt rum för synonymextraktion. Den här metoden möjliggör en kombination av olika modeller för distributionell semantik, med olika parameterkonfigurationer samt inducerade från olika korpusar. Detta är inte minst viktigt i den kliniska domänen, då det gör det möjligt att komplettera potentiellt begränsade mängder kliniska data med data från andra, mer lättillgängliga källor. Arbetet påvisar också vikten av att konfigurera dimensionaliteten av semantiska rum, i synnerhet när vokabulären är omfattande, vilket är vanligt i den kliniska domänen.
High-Performance Data Mining for Drug Effect Detection (DADEL)

APA, Harvard, Vancouver, ISO, and other styles

9

Khizra, Shufa. "Using Natural Language Processing and Machine Learning for Analyzing Clinical Notes in Sickle Cell Disease Patients." Wright State University / OhioLINK, 2018. http://rave.ohiolink.edu/etdc/view?acc_num=wright154759374321405.

Full text

APA, Harvard, Vancouver, ISO, and other styles

10

Islam, Mohammed Ashrafull. "Enhancing the interactivity of a clinical decision support system by using knowledge engineering and natural language processing." Thesis, Aston University, 2018. http://publications.aston.ac.uk/37540/.

Full text

Abstract:

Mental illness is a serious health problem and it affects many people. Increasingly,Clinical Decision Support Systems (CDSS) are being used for diagnosis and it is important to improve the reliability and performance of these systems. Missing a potential clue or a wrong diagnosis can have a detrimental effect on the patient's quality of life and could lead to a fatal outcome. The context of this research is the Galatean Risk and Safety Tool (GRiST), a mental-health-risk assessment system. Previous research has shown that success of a CDSS depends on its ease of use, reliability and interactivity. This research addresses these concerns for the GRiST by deploying data mining techniques. Clinical narratives and numerical data have both been analysed for this purpose. Clinical narratives have been processed by natural language processing (NLP)technology to extract knowledge from them. SNOMED-CT was used as a reference ontology and the performance of the different extraction algorithms have been compared. A new Ensemble Concept Mining (ECM) method has been proposed, which may eliminate the need for domain specific phrase annotation requirements. Word embedding has been used to filter phrases semantically and to build a semantic representation of each of the GRiST ontology nodes. The Chi-square and FP-growth methods have been used to find relationships between GRiST ontology nodes. Interesting patterns have been found that could be used to provide real-time feedback to clinicians. Information gain has been used efficaciously to explain the differences between the clinicians and the consensus risk. A new risk management strategy has been explored by analysing repeat assessments. A few novel methods have been proposed to perform automatic background analysis of the patient data and improve the interactivity and reliability of GRiST and similar systems.

APA, Harvard, Vancouver, ISO, and other styles

11

Alnazzawi, Noha Abdulkareem D. "Linking clinical records to the biomedical literature." Thesis, University of Manchester, 2016. https://www.research.manchester.ac.uk/portal/en/theses/linking-clinical-records-to-the-biomedical-literature(7ab62b2f-2178-49f3-9b5f-a8c15598cae7).html.

Full text

Abstract:

Narrative information in Electronic Health Records (EHRs) contains a wealth of clinical information about treatments, diagnosis, medication and family history. In addition, the scientific literature represents a rich source of information that summarises the latest results and new research findings relevant to different diseases. These two textual sources often contain different types of valuable phenotypic information that may be complementary to each other. Combining details from each source thus has the potential to be useful in uncovering new disease-phenotypic associations. In turn, these associations can help to identify patients with high risk factors, and they can be useful in developing solutions to control the causes responsible for the development of different diseases. However, clinicians at the point of care have limited time to review the large volume of potentially useful information that is locked away in unstructured text format. This in turn limits the utility of this “raw” information to clinical practitioners and computerised applications. Accordingly, the provision of automated and efficient means to extract, combine and present phenotype information that may be scattered amongst a large number of different textual sources in an easily digestible format is a prerequisite to the effective use and comprehensive understanding of details contained within both the records and the literature. The development of such facilities can in turn help in deriving information about disease correlations and supporting clinical decisions. This thesis is the first comprehensive study focussing on extracting and integrating phenotypic information from two different biomedical sources using Text Mining (TM) techniques. In this research, we describe our work on (1) extracting phenotypic information from both EHRs and the biomedical literature; (2) extracting the relations between phenotypic information and distilling them from EHRs using an event-based approach; and (3) using normalisation methods to link the phenotypic information found in EHRs with associated mentions found in the literature as a first step towards the automatic integration of information from these heterogeneous sources.

APA, Harvard, Vancouver, ISO, and other styles

12

Dehghan, Azad. "Mining patient journeys from healthcare narratives." Thesis, University of Manchester, 2015. https://www.research.manchester.ac.uk/portal/en/theses/mining-patient-journeys-from-healthcare-narratives(69ebfa6d-764a-4dfe-bbf8-6aab1905a6f3).html.

Full text

Abstract:

The aim of the thesis is to investigate the feasibility of using text mining methods to reconstruct patient journeys from unstructured clinical narratives. A novel method to extract and represent patient journeys is proposed and evaluated in this thesis. A composition of methods were designed, developed and evaluated to this end; which included health-related concept extraction, temporal information extraction, and concept clustering and automated work-flow generation. A suite of methods to extract clinical information from healthcare narratives were proposed and evaluated in order to enable chronological ordering of clinical concepts. Specifically, we proposed and evaluated a data-driven method to identify key clinical events (i.e., medical problems, treatments, and tests) using a sequence labelling algorithm, CRF, with a combination of lexical and syntactic features, and a rule-based post-processing method including label correction, boundary adjustment and false positive filter. The method was evaluated as part of the 2012 i2b2 challengeand achieved a state-of-the-art performance with a strict and lenient micro F1-measure of 83.45% and 91.13% respectively. A method to extract temporal expressions using a hybrid knowledge- (dictionary and rules) and data-driven (CRF) has been proposed and evaluated. The method demonstrated the state-of-the-art performance at the 2012 i2b2 challenge: F1-measure of 90.48% and accuracy of 70.44% for identification and normalisation respectively. For temporal ordering of events we proposed and evaluated a knowledge-driven method, with a F1-measure of 62.96% (considering the reduced temporal graph) or 70.22% for extraction of temporal links. The method developed consisted of initial rule-based identification and classification components which utilised contextual lexico-syntactic cues for inter-sentence links, string similarity for co-reference links, and subsequently a temporal closure component to calculate transitive relations of the extracted links. In a case study of survivors of childhood central nervous system tumours (medulloblastoma), qualitative evaluation showed that we were able to capture specific trends part of patient journeys. An overall quantitative evaluation score (average precision and recall) of 94-100% for individual and 97% for aggregated patient journeys were also achieved. Hence, indicating that text mining methods can be used to identify, extract and temporally organise key clinical concepts that make up a patient’s journey. We also presented an analyses of healthcare narratives, specifically exploring the content of clinical and patient narratives by using methods developed to extract patient journeys. We found that health-related quality of life concepts are more common in patient narrative, while clinical concepts (e.g., medical problems, treatments, tests) are more prevalent in clinical narratives. In addition, while both aggregated sets of narratives contain all investigated concepts; clinical narratives contain, proportionally, more health-related quality of life concepts than clinical concepts found in patient narratives. These results demonstrate that automated concept extraction, in particular health-related quality of life, as part of standard clinical practice is feasible. The proposed method presented herein demonstrated that text mining methods can be efficiently used to identify, extract and temporally organise key clinical concepts that make up a patient’s journey in a healthcare system. Automated reconstruction of patient journeys can potentially be of value for clinical practitioners and researchers, to aid large scale analyses of implemented care pathways, and subsequently help monitor, compare, develop and adjust clinical guidelines both in the areas of chronic diseases where there is plenty of data and rare conditions where potentially there are no established guidelines.

APA, Harvard, Vancouver, ISO, and other styles

13

Shivade, Chaitanya P. "How sick are you?Methods for extracting textual evidence to expedite clinical trial screening." The Ohio State University, 2016. http://rave.ohiolink.edu/etdc/view?acc_num=osu1462810822.

Full text

APA, Harvard, Vancouver, ISO, and other styles

14

Tang, Huaxiu. "Detecting Adverse Drug Reactions in Electronic Health Records by using the Food and Drug Administration’s Adverse Event Reporting System." University of Cincinnati / OhioLINK, 2016. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1470753258.

Full text

APA, Harvard, Vancouver, ISO, and other styles

15

Velupillai, Sumithra. "Shades of Certainty : Annotation and Classification of Swedish Medical Records." Doctoral thesis, Stockholms universitet, Institutionen för data- och systemvetenskap, 2012. http://urn.kb.se/resolve?urn=urn:nbn:se:su:diva-74828.

Full text

Abstract:

Access to information is fundamental in health care. This thesis presents research on Swedish medical records with the overall goal of building intelligent information access tools that can aid health personnel, researchers and other professions in their daily work, and, ultimately, improve health care in general. The issue of ethics and identifiable information is addressed by creating an annotated gold standard corpus and porting an existing de-identification system to Swedish from English. The aim is to move towards making textual resources available to researchers without risking exposure of patients’ confidential information. Results for the rule-based system are not encouraging, but results for the gold standard are fairly high. Affirmed, uncertain and negated information needs to be distinguished when building accurate information extraction tools. Annotation models are created, with the aim of building automated systems. One model distinguishes certain and uncertain sentences, and is applied on medical records from several clinical departments. In a second model, two polarities and three levels of certainty are applied on diagnostic statements from an emergency department. Overall results are promising. Differences are seen depending on clinical practice, annotation task and level of domain expertise among the annotators. Using annotated resources for automatic classification is studied. Encouraging overall results using local context information are obtained. The fine-grained certainty levels are used for building classifiers for real-world e-health scenarios. This thesis contributes two annotation models of certainty and one of identifiable information, applied on Swedish medical records. A deeper understanding of the language use linked to conveying certainty levels is gained. Three annotated resources that can be used for further research have been created, and implications for automated systems are presented.

APA, Harvard, Vancouver, ISO, and other styles

16

Bustos, Aurelia. "Extraction of medical knowledge from clinical reports and chest x-rays using machine learning techniques." Doctoral thesis, Universidad de Alicante, 2019. http://hdl.handle.net/10045/102193.

Full text

Abstract:

This thesis addresses the extraction of medical knowledge from clinical text using deep learning techniques. In particular, the proposed methods focus on cancer clinical trial protocols and chest x-rays reports. The main results are a proof of concept of the capability of machine learning methods to discern which are regarded as inclusion or exclusion criteria in short free-text clinical notes, and a large scale chest x-ray image dataset labeled with radiological findings, diagnoses and anatomic locations. Clinical trials provide the evidence needed to determine the safety and effectiveness of new medical treatments. These trials are the basis employed for clinical practice guidelines and greatly assist clinicians in their daily practice when making decisions regarding treatment. However, the eligibility criteria used in oncology trials are too restrictive. Patients are often excluded on the basis of comorbidity, past or concomitant treatments and the fact they are over a certain age, and those patients that are selected do not, therefore, mimic clinical practice. This signifies that the results obtained in clinical trials cannot be extrapolated to patients if their clinical profiles were excluded from the clinical trial protocols. The efficacy and safety of new treatments for patients with these characteristics are not, therefore, defined. Given the clinical characteristics of particular patients, their type of cancer and the intended treatment, discovering whether or not they are represented in the corpus of available clinical trials requires the manual review of numerous eligibility criteria, which is impracticable for clinicians on a daily basis. In this thesis, a large medical corpora comprising all cancer clinical trials protocols in the last 18 years published by competent authorities was used to extract medical knowledge in order to help automatically learn patient’s eligibility in these trials. For this, a model is built to automatically predict whether short clinical statements were considered inclusion or exclusion criteria. A method based on deep neural networks is trained on a dataset of 6 million short free-texts to classify them between elegible or not elegible. For this, pretrained word embeddings were used as inputs in order to predict whether or not short free-text statements describing clinical information were considered eligible. The semantic reasoning of the word-embedding representations obtained was also analyzed, being able to identify equivalent treatments for a type of tumor in an analogy with the drugs used to treat other tumors. Results show that representation learning using deep neural networks can be successfully leveraged to extract the medical knowledge from clinical trial protocols and potentially assist practitioners when prescribing treatments. The second main task addressed in this thesis is related to knowledge extraction from medical reports associated with radiographs. Conventional radiology remains the most performed technique in radiodiagnosis services, with a percentage close to 75% (Radiología Médica, 2010). In particular, chest x-ray is the most common medical imaging exam with over 35 million taken every year in the US alone (Kamel et al., 2017). They allow for inexpensive screening of several pathologies including masses, pulmonary nodules, effusions, cardiac abnormalities and pneumothorax. For this task, all the chest-x rays that had been interpreted and reported by radiologists at the Hospital Universitario de San Juan (Alicante) from Jan 2009 to Dec 2017 were used to build a novel large-scale dataset in which each high-resolution radiograph is labeled with its corresponding metadata, radiological findings and pathologies. This dataset, named PadChest, includes more than 160,000 images obtained from 67,000 patients, covering six different position views and additional information on image acquisition and patient demography. The free text reports written in Spanish by radiologists were labeled with 174 different radiographic findings, 19 differential diagnoses and 104 anatomic locations organized as a hierarchical taxonomy and mapped onto standard Unified Medical Language System (UMLS) terminology. For this, a subset of the reports (a 27%) were manually annotated by trained physicians, whereas the remaining set was automatically labeled with deep supervised learning methods using attention mechanisms and fed with the text reports. The labels generated were then validated in an independent test set achieving a 0.93 Micro-F1 score. To the best of our knowledge, this is one of the largest public chest x-ray databases suitable for training supervised models concerning radiographs, and also the first to contain radiographic reports in Spanish. The PadChest dataset can be downloaded on request from http://bimcv.cipf.es/bimcv-projects/padchest/. PadChest is intended for training image classifiers based on deep learning techniques to extract medical knowledge from chest x-rays. It is essential that automatic radiology reporting methods could be integrated in a clinically validated manner in radiologists’ workflow in order to help specialists to improve their efficiency and enable safer and actionable reporting. Computer vision methods capable of identifying both the large spectrum of thoracic abnormalities (and also the normality) need to be trained on large-scale comprehensively labeled large-scale x-ray datasets such as PadChest. The development of these computer vision tools, once clinically validated, could serve to fulfill a broad range of unmet needs. Beyond implementing and obtaining results for both clinical trials and chest x-rays, this thesis studies the nature of the health data, the novelty of applying deep learning methods to obtain large-scale labeled medical datasets, and the relevance of its applications in medical research, which have contributed to its extramural diffusion and worldwide reach. This thesis describes this journey so that the reader is navigated across multiple disciplines, from engineering to medicine up to ethical considerations in artificial intelligence applied to medicine.

APA, Harvard, Vancouver, ISO, and other styles

17

Kempf, Emmanuelle. "Structuration, standardisation et enrichissement par traitement automatique du langage des données relatives au cancer au sein de l’entrepôt de données de santé de l’Assistance Publique – Hôpitaux de Paris." Electronic Thesis or Diss., Sorbonne université, 2023. http://www.theses.fr/2023SORUS694.

Full text

Abstract:

Le cancer est un enjeu de santé publique dont l’amélioration de la prise en charge repose, entre autres leviers, sur l’exploitation d’entrepôts de données de santé (EDS). Leur utilisation implique la maîtrise d’obstacles tels que la qualité, la standardisation et la structuration des données de soins qui y sont stockées. L’objectif de cette thèse était de démontrer qu’il est possible de lever les verrous d’utilisation secondaire des données de l’EDS de l’Assistance Publique - Hôpitaux de Paris (AP-HP) concernant des patients atteints de cancer à diverses finalités telles que le pilotage de la sécurité et de la qualité des soins, et les projets de recherche clinique observationnelle et expérimentale. En premier lieu, l’identification d’un jeu de données minimales a permis de concentrer l’effort de formalisation des items d’intérêt propres à la discipline. A partir de 15 items identifiés, 4 cas d’usages relevant de perspectives médicales distinctes ont pu être développés avec succès : pilotage concernant l’automatisation de calculs d’indicateurs de sécurité et de qualité des soins nécessaires à la certification internationale des établissements de santé, épidémiologie clinique concernant l’impact des mesures de santé publique en temps de pandémie sur le retard diagnostic des cancers, aide à la décision concernant l’optimisation du recrutement des patients dans des essais cliniques, développement de réseaux de neurones concernant la pronostication par vision par ordinateur. Une deuxième condition nécessaire à l’exploitation d’un EDS en oncologie repose sur la formalisation optimale et interopérable entre plusieurs EDS de ce jeu de données minimales. Dans le cadre de l’initiative française PENELOPE visant à améliorer le recrutement des patients dans des essais cliniques, la thèse a évalué la plus-value de l’extension oncologie du modèle de données commun OMOP. Cette version 5.4 d’OMOP permettait de doubler le taux de formalisation de critères de préscreening d’essais cliniques de phase I à IV. Seulement 23% de ces critères pouvaient être requetés automatiquement sur l’EDS de l’AP-HP, et ce, modulo une valeur prédictive positive inférieure à 30%. Ce travail propose une méthodologie inédite pour évaluer la performance d'un système d’aide au recrutement : à partir des métriques habituelles (sensibilité, spécificité, valeur prédictive positive, valeur prédictive négative), mais aussi à partir d’indicateurs complémentaires caractérisant l’adéquation du modèle choisi avec l’EDS concerné (taux de traduction et d'exécution des requêtes). Enfin, le travail a permis de montrer le caractère palliatif du traitement automatique du langage naturel concernant la structuration des données d'un EDS en informant le bilan d’extension initial d’un diagnostic de cancer et les caractéristiques histopronostiques des tumeurs. La confrontation des métriques de performance d’extraction textuelle et des ressources humaines et techniques nécessaires au développement de systèmes de règles et d’apprentissage automatique a permis de valoriser, pour un certain nombre de situations, la première approche. La thèse a identifié qu’une préannotation automatique à base de règles avant une phase d’annotation manuelle pour entraînement d’un modèle d’apprentissage machine était une approche optimisable. Les règles semblent suffire pour les tâches d’extraction textuelle d’une certaine typologie d’entités bien caractérisée sur un plan lexical et sémantique. L’anticipation et la modélisation de cette typologie pourrait être possible en amont de la phase d’extraction textuelle, afin de différencier, en fonction de chaque type d’entité, dans quelle mesure l’apprentissage machine devrait suppléer aux règles. La thèse a permis de démontrer qu’une attention portée à un certain nombre de thématiques des sciences des données permettait l’utilisation efficiente d’un EDS et ce, à des fins diverses en oncologie
Cancer is a public health issue for which the improvement of care relies, among other levers, on the use of clinical data warehouses (CDWs). Their use involves overcoming obstacles such as the quality, standardization and structuring of the care data stored there. The objective of this thesis was to demonstrate that it is possible to address the challenges of secondary use of data from the Assistance Publique - Hôpitaux de Paris (AP-HP) CDW regarding cancer patients, and for various purposes such as monitoring the safety and quality of care, and performing observational and experimental clinical research. First, the identification of a minimal data set enabled to concentrate the effort of formalizing the items of interest specific to the discipline. From 15 identified items, 4 use cases from distinct medical perspectives were successfully developed: automation of calculations of safety and quality of care required for the international certification of health establishments , clinical epidemiology regarding the impact of public health measures during a pandemic on the delay in cancer diagnosis, decision support regarding the optimization of patient recruitment in clinical trials, development of neural networks regarding prognostication by computer vision. A second condition necessary for the CDW use in oncology is based on the optimal and interoperable formalization between several CDWs of this minimal data set. As part of the French PENELOPE initiative aiming at improving patient recruitment in clinical trials, the thesis assessed the added value of the oncology extension of the OMOP common data model. This version 5.4 of OMOP enabled to double the rate of formalization of prescreening criteria for phase I to IV clinical trials. Only 23% of these criteria could be automatically queried on the AP-HP CDW, and this, modulo a positive predictive value of less than 30%. This work suggested a novel methodology for evaluating the performance of a recruitment support system: based on the usual metrics (sensitivity, specificity, positive predictive value, negative predictive value), but also based on additional indicators characterizing the adequacy of the model chosen with the CDW related (rate of translation and execution of queries). Finally, the work showed how natural language processing related to the CDW data structuring could enrich the minimal data set, based on the baseline tumor dissemination assessment of a cancer diagnosis and on the histoprognostic characteristics of tumors. The comparison of textual extraction performance metrics and the human and technical resources necessary for the development of rules and machine learning systems made it possible to promote, for a certain number of situations, the first approach. The thesis identified that automatic rule-based preannotation before a manual annotation phase for training a machine learning model was an optimizable approach. The rules seemed to be sufficient for textual extraction tasks of a certain typology of entities that are well characterized on a lexical and semantic level. Anticipation and modeling of this typology could be possible upstream of the textual extraction phase, in order to differentiate, depending on each type of entity, to what extent machine learning should replace the rules. The thesis demonstrated that a close attention to a certain number of data science challenges allowed the efficient use of a CDW for various purposes in oncology

APA, Harvard, Vancouver, ISO, and other styles

18

Matsubara, Shigeki. "Corpus-based Natural Language Processing." INTELLIGENT MEDIA INTEGRATION NAGOYA UNIVERSITY / COE, 2004. http://hdl.handle.net/2237/10355.

Full text

APA, Harvard, Vancouver, ISO, and other styles

19

Smith, Sydney. "Approaches to Natural Language Processing." Scholarship @ Claremont, 2018. http://scholarship.claremont.edu/cmc_theses/1817.

Full text

Abstract:

This paper explores topic modeling through the example text of Alice in Wonderland. It explores both singular value decomposition as well as non-‐‑negative matrix factorization as methods for feature extraction. The paper goes on to explore methods for partially supervised implementation of topic modeling through introducing themes. A large portion of the paper also focuses on implementation of these techniques in python as well as visualizations of the results which use a combination of python, html and java script along with the d3 framework. The paper concludes by presenting a mixture of SVD, NMF and partially-‐‑supervised NMF as a possible way to improve topic modeling.

APA, Harvard, Vancouver, ISO, and other styles

20

Strandberg, Aron, and Patrik Karlström. "Processing Natural Language for the Spotify API : Are sophisticated natural language processing algorithms necessary when processing language in a limited scope?" Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-186867.

Full text

Abstract:

Knowing whether you can implement something complex in a simple way in your application is always of interest. A natural language interface is some- thing that could theoretically be implemented in a lot of applications but the complexity of most natural language processing algorithms is a limiting factor. The problem explored in this paper is whether a simpler algorithm that doesn’t make use of convoluted statistical models and machine learning can be good enough. We implemented two algorithms, one utilizing Spotify’s own search and one with a more accurate, o✏ine search. With the best precision we could muster being 81% at an average of 2,28 seconds per query this is not a viable solution for a complete and satisfactory user experience. Further work could push the performance into an acceptable range.

APA, Harvard, Vancouver, ISO, and other styles

21

Chen, Joseph C. H. "Quantum computation and natural language processing." [S.l.] : [s.n.], 2002. http://deposit.ddb.de/cgi-bin/dokserv?idn=965581020.

Full text

APA, Harvard, Vancouver, ISO, and other styles

22

Knight, Sylvia Frances. "Natural language processing for aerospace documentation." Thesis, University of Cambridge, 2001. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.621395.

Full text

APA, Harvard, Vancouver, ISO, and other styles

23

Naphtal, Rachael (Rachael M. ). "Natural language processing based nutritional application." Thesis, Massachusetts Institute of Technology, 2015. http://hdl.handle.net/1721.1/100640.

Full text

Abstract:

Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2015.
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Cataloged from student-submitted PDF version of thesis.
Includes bibliographical references (pages 67-68).
The ability to accurately and eciently track nutritional intake is a powerful tool in combating obesity and other food related diseases. Currently, many methods used for this task are time consuming or easily abandoned; however, a natural language based application that converts spoken text to nutritional information could be a convenient and eective solution. This thesis describes the creation of an application that translates spoken food diaries into nutritional database entries. It explores dierent methods for solving the problem of converting brands, descriptions and food item names into entries in nutritional databases. Specifically, we constructed a cache of over 4,000 food items, and also created a variety of methods to allow refinement of database mappings. We also explored methods of dealing with ambiguous quantity descriptions and the mapping of spoken quantity values to numerical units. When assessed by 500 users entering their daily meals on Amazon Mechanical Turk, the system was able to map 83.8% of the correctly interpreted spoken food items to relevant nutritional database entries. It was also able to nd a logical quantity for 92.2% of the correct food entries. Overall, this system shows a signicant step towards the intelligent conversion of spoken food diaries to actual nutritional feedback.
by Rachael Naphtal.
M. Eng.

APA, Harvard, Vancouver, ISO, and other styles

24

Eriksson, Simon. "COMPARING NATURAL LANGUAGE PROCESSING TO STRUCTURED QUERY LANGUAGE ALGORITHMS." Thesis, Umeå universitet, Institutionen för datavetenskap, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-163310.

Full text

Abstract:

Using natural language processing to create Structured Query Language (SQL) queries has many benefits in theory. Even though SQL is an expressive and powerful language it requires certain technical knowledge to use. An interface effectively utilizing natural language processing would instead allow the user to communicate with the SQL database as if they were communicating with another human being. In this paper I compare how two of the currently most advanced open source algorithms (TypeSQL and SyntaxSQL) in this field can understandadvanced SQL. I show that SyntaxSQL is signicantly more accurate but makes some sacrices in execution time compared to TypeSQL.

APA, Harvard, Vancouver, ISO, and other styles

25

Kesarwani, Vaibhav. "Automatic Poetry Classification Using Natural Language Processing." Thesis, Université d'Ottawa / University of Ottawa, 2018. http://hdl.handle.net/10393/37309.

Full text

Abstract:

Poetry, as a special form of literature, is crucial for computational linguistics. It has a high density of emotions, figures of speech, vividness, creativity, and ambiguity. Poetry poses a much greater challenge for the application of Natural Language Processing algorithms than any other literary genre. Our system establishes a computational model that classifies poems based on similarity features like rhyme, diction, and metaphor. For rhyme analysis, we investigate the methods used to classify poems based on rhyme patterns. First, the overview of different types of rhymes is given along with the detailed description of detecting rhyme type and sub-types by the application of a pronunciation dictionary on our poetry dataset. We achieve an accuracy of 96.51% in identifying rhymes in poetry by applying a phonetic similarity model. Then we achieve a rhyme quantification metric RhymeScore based on the matching phonetic transcription of each poem. We also develop an application for the visualization of this quantified RhymeScore as a scatter plot in 2 or 3 dimensions. For diction analysis, we investigate the methods used to classify poems based on diction. First the linguistic quantitative and semantic features that constitute diction are enumerated. Then we investigate the methodology used to compute these features from our poetry dataset. We also build a word embeddings model on our poetry dataset with 1.5 million words in 100 dimensions and do a comparative analysis with GloVe embeddings. Metaphor is a part of diction, but as it is a very complex topic in its own right, we address it as a stand-alone issue and develop several methods for it. Previous work on metaphor detection relies on either rule-based or statistical models, none of them applied to poetry. Our methods focus on metaphor detection in a poetry corpus, but we test on non-poetry data as well. We combine rule-based and statistical models (word embeddings) to develop a new classification system. Our first metaphor detection method achieves a precision of 0.759 and a recall of 0.804 in identifying one type of metaphor in poetry, by using a Support Vector Machine classifier with various types of features. Furthermore, our deep learning model based on a Convolutional Neural Network achieves a precision of 0.831 and a recall of 0.836 for the same task. We also develop an application for generic metaphor detection in any type of natural text.

APA, Harvard, Vancouver, ISO, and other styles

26

Pham, Son Bao Computer Science &amp Engineering Faculty of Engineering UNSW. "Incremental knowledge acquisition for natural language processing." Awarded by:University of New South Wales. School of Computer Science and Engineering, 2006. http://handle.unsw.edu.au/1959.4/26299.

Full text

Abstract:

Linguistic patterns have been used widely in shallow methods to develop numerous NLP applications. Approaches for acquiring linguistic patterns can be broadly categorised into three groups: supervised learning, unsupervised learning and manual methods. In supervised learning approaches, a large annotated training corpus is required for the learning algorithms to achieve decent results. However, annotated corpora are expensive to obtain and usually available only for established tasks. Unsupervised learning approaches usually start with a few seed examples and gather some statistics based on a large unannotated corpus to detect new examples that are similar to the seed ones. Most of these approaches either populate lexicons for predefined patterns or learn new patterns for extracting general factual information; hence they are applicable to only a limited number of tasks. Manually creating linguistic patterns has the advantage of utilising an expert's knowledge to overcome the scarcity of annotated data. In tasks with no annotated data available, the manual way seems to be the only choice. One typical problem that occurs with manual approaches is that the combination of multiple patterns, possibly being used at different stages of processing, often causes unintended side effects. Existing approaches, however, do not focus on the practical problem of acquiring those patterns but rather on how to use linguistic patterns for processing text. A systematic way to support the process of manually acquiring linguistic patterns in an efficient manner is long overdue. This thesis presents KAFTIE, an incremental knowledge acquisition framework that strongly supports experts in creating linguistic patterns manually for various NLP tasks. KAFTIE addresses difficulties in manually constructing knowledge bases of linguistic patterns, or rules in general, often faced in existing approaches by: (1) offering a systematic way to create new patterns while ensuring they are consistent; (2) alleviating the difficulty in choosing the right level of generality when creating a new pattern; (3) suggesting how existing patterns can be modified to improve the knowledge base's performance; (4) making the effort in creating a new pattern, or modifying an existing pattern, independent of the knowledge base's size. KAFTIE, therefore, makes it possible for experts to efficiently build large knowledge bases for complex tasks. This thesis also presents the KAFDIS framework for discourse processing using new representation formalisms: the level-of-detail tree and the discourse structure graph.

APA, Harvard, Vancouver, ISO, and other styles

27

張少能 and Siu-nang Bruce Cheung. "A concise framework of natural language processing." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 1989. http://hub.hku.hk/bib/B31208563.

Full text

APA, Harvard, Vancouver, ISO, and other styles

28

Cahill, Lynne Julie. "Syllable-based morphology for natural language processing." Thesis, University of Sussex, 1990. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.386529.

Full text

Abstract:

This thesis addresses the problem of accounting for morphological alternation within Natural Language Processing. It proposes an approach to morphology which is based on phonological concepts, in particular the syllable, in contrast to morpheme-based approaches which have standardly been used by both NLP and linguistics. It is argued that morpheme-based approaches, within both linguistics and NLP, grew out of the apparently purely affixational morphology of European languages, and especially English, but are less appropriate for non-affixational languages such as Arabic. Indeed, it is claimed that even accounts of those European languages miss important linguistic generalizations by ignoring more phonologically based alternations, such as umlaut in German and ablaut in English. To justify this approach, we present a wide range of data from languages as diverse as German and Rotuman. A formal language, MOLUSe, is described, which allows for the definition of declarative mappings between syllable-sequences, and accounts of non-trivial fragments of the inflectional morphology of English, Arabic and Sanskrit are presented, to demonstrate the capabilities of the language. A semantics for the language is defined, and the implementation of an interpreter is described. The thesis discusses theoretical (linguistic) issues, as well as implementational issues involved in the incorporation of MOLUSC into a larger lexicon system. The approach is contrasted with previous work in computational morphology, in particular finite-state morphology, and its relation to other work in the fields of morphology and phonology is also discussed.

APA, Harvard, Vancouver, ISO, and other styles

29

Lei, Tao Ph D. Massachusetts Institute of Technology. "Interpretable neural models for natural language processing." Thesis, Massachusetts Institute of Technology, 2017. http://hdl.handle.net/1721.1/108990.

Full text

Abstract:

Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2017.
Cataloged from PDF version of thesis.
Includes bibliographical references (pages 109-119).
The success of neural network models often comes at a cost of interpretability. This thesis addresses the problem by providing justifications behind the model's structure and predictions. In the first part of this thesis, we present a class of sequence operations for text processing. The proposed component generalizes from convolution operations and gated aggregations. As justifications, we relate this component to string kernels, i.e. functions measuring the similarity between sequences, and demonstrate how it encodes the efficient kernel computing algorithm into its structure. The proposed model achieves state-of-the-art or competitive results compared to alternative architectures (such as LSTMs and CNNs) across several NLP applications. In the second part, we learn rationales behind the model's prediction by extracting input pieces as supporting evidence. Rationales are tailored to be short and coherent, yet sufficient for making the same prediction. Our approach combines two modular components, generator and encoder, which are trained to operate well together. The generator specifies a distribution over text fragments as candidate rationales and these are passed through the encoder for prediction. Rationales are never given during training. Instead, the model is regularized by the desiderata for rationales. We demonstrate the effectiveness of this learning framework in applications such multi-aspect sentiment analysis. Our method achieves a performance over 90% evaluated against manual annotated rationales.
by Tao Lei.
Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

30

Grinman, Alex J. "Natural language processing on encrypted patient data." Thesis, Massachusetts Institute of Technology, 2016. http://hdl.handle.net/1721.1/113438.

Full text

Abstract:

Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2016.
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Cataloged from student-submitted PDF version of thesis.
Includes bibliographical references (pages 85-86).
While many industries can benefit from machine learning techniques for data analysis, they often do not have the technical expertise nor computational power to do so. Therefore, many organizations would benefit from outsourcing their data analysis. Yet, stringent data privacy policies prevent outsourcing sensitive data and may stop the delegation of data analysis in its tracks. In this thesis, we put forth a two-party system where one party capable of powerful computation can run certain machine learning algorithms from the natural language processing domain on the second party's data, where the first party is limited to learning only specific functions of the second party's data and nothing else. Our system provides simple cryptographic schemes for locating keywords, matching approximate regular expressions, and computing frequency analysis on encrypted data. We present a full implementation of this system in the form of a extendible software library and a command line interface. Finally, we discuss a medical case study where we used our system to run a suite of unmodified machine learning algorithms on encrypted free text patient notes.
by Alex J. Grinman.
M. Eng.

APA, Harvard, Vancouver, ISO, and other styles

31

Alharthi, Haifa. "Natural Language Processing for Book Recommender Systems." Thesis, Université d'Ottawa / University of Ottawa, 2019. http://hdl.handle.net/10393/39134.

Full text

Abstract:

The act of reading has benefits for individuals and societies, yet studies show that reading declines, especially among the young. Recommender systems (RSs) can help stop such decline. There is a lot of research regarding literary books using natural language processing (NLP) methods, but the analysis of textual book content to improve recommendations is relatively rare. We propose content-based recommender systems that extract elements learned from book texts to predict readers’ future interests. One factor that influences reading preferences is writing style; we propose a system that recommends books after learning their authors’ writing style. To our knowledge, this is the first work that transfers the information learned by an author-identification model to a book RS. Another approach that we propose uses over a hundred lexical, syntactic, stylometric, and fiction-based features that might play a role in generating high-quality book recommendations. Previous book RSs include very few stylometric features; hence, our study is the first to include and analyze a wide variety of textual elements for book recommendations. We evaluated both approaches according to a top-k recommendation scenario. They give better accuracy when compared with state-of-the-art content and collaborative filtering methods. We highlight the significant factors that contributed to the accuracy of the recommendations using a forest of randomized regression trees. We also conducted a qualitative analysis by checking if similar books/authors were annotated similarly by experts. Our content-based systems suffer from the new user problem, well-known in the field of RSs, that hinders their ability to make accurate recommendations. Therefore, we propose a Topic Model-Based book recommendation component (TMB) that addresses the issue by using the topics learned from a user’s shared text on social media, to recognize their interests and map them to related books. To our knowledge, there is no literature regarding book RSs that exploits public social networks other than book-cataloging websites. Using topic modeling techniques, extracting user interests can be automatic and dynamic, without the need to search for predefined concepts. Though TMB is designed to complement other systems, we evaluated it against a traditional book CB. We assessed the top k recommendations made by TMB and CB and found that both retrieved a comparable number of books, even though CB relied on users’ rating history, while TMB only required their social profiles.

APA, Harvard, Vancouver, ISO, and other styles

32

Medlock, Benjamin William. "Investigating classification for natural language processing tasks." Thesis, University of Cambridge, 2008. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.611949.

Full text

APA, Harvard, Vancouver, ISO, and other styles

33

Huang, Yin Jou. "Event Centric Approaches in Natural Language Processing." Doctoral thesis, Kyoto University, 2021. http://hdl.handle.net/2433/265210.

Full text

APA, Harvard, Vancouver, ISO, and other styles

34

Woldemariam, Yonas Demeke. "Natural language processing in cross-media analysis." Licentiate thesis, Umeå universitet, Institutionen för datavetenskap, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-147640.

Full text

Abstract:

A cross-media analysis framework is an integrated multi-modal platform where a media resource containing different types of data such as text, images, audio and video is analyzed with metadata extractors, working jointly to contextualize the media resource. It generally provides cross-media analysis and automatic annotation, metadata publication and storage, searches and recommendation services. For on-line content providers, such services allow them to semantically enhance a media resource with the extracted metadata representing the hidden meanings and make it more efficiently searchable. Within the architecture of such frameworks, Natural Language Processing (NLP) infrastructures cover a substantial part. The NLP infrastructures include text analysis components such as a parser, named entity extraction and linking, sentiment analysis and automatic speech recognition. Since NLP tools and techniques are originally designed to operate in isolation, integrating them in cross-media frameworks and analyzing textual data extracted from multimedia sources is very challenging. Especially, the text extracted from audio-visual content lack linguistic features that potentially provide important clues for text analysis components. Thus, there is a need to develop various techniques to meet the requirements and design principles of the frameworks. In our thesis, we explore developing various methods and models satisfying text and speech analysis requirements posed by cross-media analysis frameworks. The developed methods allow the frameworks to extract linguistic knowledge of various types and predict various information such as sentiment and competence. We also attempt to enhance the multilingualism of the frameworks by designing an analysis pipeline that includes speech recognition, transliteration and named entity recognition for Amharic, that also enables the accessibility of Amharic contents on the web more efficiently. The method can potentially be extended to support other under-resourced languages.

APA, Harvard, Vancouver, ISO, and other styles

35

Cheung, Siu-nang Bruce. "A concise framework of natural language processing /." [Hong Kong : University of Hong Kong], 1989. http://sunzi.lib.hku.hk/hkuto/record.jsp?B12432544.

Full text

APA, Harvard, Vancouver, ISO, and other styles

36

Dawborn, Timothy James. "DOCREP: Document Representation for Natural Language Processing." Thesis, The University of Sydney, 2015. http://hdl.handle.net/2123/14767.

Full text

Abstract:

The field of natural language processing (NLP) revolves around the computational interpretation and generation of natural language. The language typically processed in NLP occurs in paragraphs or documents rather than in single isolated sentences. Despite this, most NLP tools operate over one sentence at a time, not utilising the context outside of the sentence nor any of the metadata associated with the underlying document. One pragmatic reason for this disparity is that representing documents and their annotations through an NLP pipeline is difficult with existing infrastructure. Representing linguistic annotations for a text document using a plain text markupbased format is not sufficient to capture arbitrarily nested and overlapping annotations. Despite this, most linguistic text corpora and NLP tools still operate in this fashion. A document representation framework (DRF) supports the creation of linguistic annotations stored separately to the original document, overcoming this nesting and overlapping annotations problem. Despite the prevalence of pipelines in NLP, there is little published work on, or implementations of, DRFs. The main DRFs, GATE and UIMA, exhibit usability issues which have limited their uptake by the NLP community. This thesis aims to solve this problem through a novel, modern DRF, DOCREP; a portmanteau of document representation. DOCREP is designed to be efficient, programming language and environment agnostic, and most importantly, easy to use. We want DOCREP to be powerful and simple enough to use that NLP researchers and language technology application developers would even use it in their own small projects instead of developing their own ad hoc solution. This thesis begins by presenting the design criteria for our new DRF, extending upon existing requirements from the literature with additional usability and efficiency requirements that should lead to greater use of DRFs. We outline how our new DRF, DOCREP, differs from existing DRFs in terms of the data model, serialisation strategy, developer interactions, support for rapid prototyping, and the expected runtime and environment requirements. We then describe our provided implementations of DOCREP in Python, C++, and Java, the most common languages in NLP; outlining their efficiency, idiomaticity, and the ways in which these implementations satisfy our design requirements. We then present two different evaluations of DOCREP. First, we evaluate its ability to model complex linguistic corpora through the conversion of the OntoNotes 5 corpus to DOCREP and UIMA, outlining the differences in modelling approaches required and efficiency when using these two DRFs. Second, we evaluate DOCREP against our usability requirements from the perspective of a computational linguist who is new to DOCREP. We walk through a number of common use cases for working with text corpora and contrast traditional approaches again their DOCREP counterpart. These two evaluations conclude that DOCREP satisfies our outlined design requirements and outperforms existing DRFs in terms of efficiency, and most importantly, usability. With DOCREP designed and evaluated, we then show how NLP applications can harness document structure. We present a novel document structureaware tokenization framework for the first stage of fullstack NLP systems. We then present a new structureaware NER system which achieves stateoftheart results on multiple standard NER evaluations. The tokenization framework produces its tokenization, sentence boundary, and document structure annotations as native DOCREP annotations. The NER system consumes DOCREP annotations and utilises many components of the DOCREP runtime. We believe that the adoption of DOCREP throughout the NLP community will assist in the reproducibility of results, substitutability of components, and overall quality assurance of NLP systems and corpora, all of which are problematic areas within NLP research and applications. This adoption will make developing and combining NLP components into applications faster, more efficient, and more reliable.

APA, Harvard, Vancouver, ISO, and other styles

37

Miao, Yishu. "Deep generative models for natural language processing." Thesis, University of Oxford, 2017. http://ora.ox.ac.uk/objects/uuid:e4e1f1f9-e507-4754-a0ab-0246f1e1e258.

Full text

Abstract:

Deep generative models are essential to Natural Language Processing (NLP) due to their outstanding ability to use unlabelled data, to incorporate abundant linguistic features, and to learn interpretable dependencies among data. As the structure becomes deeper and more complex, having an effective and efficient inference method becomes increasingly important. In this thesis, neural variational inference is applied to carry out inference for deep generative models. While traditional variational methods derive an analytic approximation for the intractable distributions over latent variables, here we construct an inference network conditioned on the discrete text input to provide the variational distribution. The powerful neural networks are able to approximate complicated non-linear distributions and grant the possibilities for more interesting and complicated generative models. Therefore, we develop the potential of neural variational inference and apply it to a variety of models for NLP with continuous or discrete latent variables. This thesis is divided into three parts. Part I introduces a generic variational inference framework for generative and conditional models of text. For continuous or discrete latent variables, we apply a continuous reparameterisation trick or the REINFORCE algorithm to build low-variance gradient estimators. To further explore Bayesian non-parametrics in deep neural networks, we propose a family of neural networks that parameterise categorical distributions with continuous latent variables. Using the stick-breaking construction, an unbounded categorical distribution is incorporated into our deep generative models which can be optimised by stochastic gradient back-propagation with a continuous reparameterisation. Part II explores continuous latent variable models for NLP. Chapter 3 discusses the Neural Variational Document Model (NVDM): an unsupervised generative model of text which aims to extract a continuous semantic latent variable for each document. In Chapter 4, the neural topic models modify the neural document models by parameterising categorical distributions with continuous latent variables, where the topics are explicitly modelled by discrete latent variables. The models are further extended to neural unbounded topic models with the help of stick-breaking construction, and a truncation-free variational inference method is proposed based on a Recurrent Stick-breaking construction (RSB). Chapter 5 describes the Neural Answer Selection Model (NASM) for learning a latent stochastic attention mechanism to model the semantics of question-answer pairs and predict their relatedness. Part III discusses discrete latent variable models. Chapter 6 introduces latent sentence compression models. The Auto-encoding Sentence Compression Model (ASC), as a discrete variational auto-encoder, generates a sentence by a sequence of discrete latent variables representing explicit words. The Forced Attention Sentence Compression Model (FSC) incorporates a combined pointer network biased towards the usage of words from source sentence, which significantly improves the performance when jointly trained with the ASC model in a semi-supervised learning fashion. Chapter 7 describes the Latent Intention Dialogue Models (LIDM) that employ a discrete latent variable to learn underlying dialogue intentions. Additionally, the latent intentions can be interpreted as actions guiding the generation of machine responses, which could be further refined autonomously by reinforcement learning. Finally, Chapter 8 summarizes our findings and directions for future work.

APA, Harvard, Vancouver, ISO, and other styles

38

Hu, Jin. "Explainable Deep Learning for Natural Language Processing." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-254886.

Full text

Abstract:

Deep learning methods get impressive performance in many Natural Neural Processing (NLP) tasks, but it is still difficult to know what happened inside a deep neural network. In this thesis, a general overview of Explainable AI and how explainable deep learning methods applied for NLP tasks is given. Then the Bi-directional LSTM and CRF (BiLSTM-CRF) model for Named Entity Recognition (NER) task is introduced, as well as the approach to make this model explainable. The approach to visualize the importance of neurons in Bi-LSTM layer of the model for NER by Layer-wise Relevance Propagation (LRP) is proposed, which can measure how neurons contribute to each predictionof a word in a sequence. Ideas about how to measure the influence of CRF layer of the Bi-LSTM-CRF model is also described.
Djupa inlärningsmetoder får imponerande prestanda i många naturliga Neural Processing (NLP) uppgifter, men det är fortfarande svårt att veta vad hände inne i ett djupt neuralt nätverk. I denna avhandling, en allmän översikt av förklarliga AI och hur förklarliga djupa inlärningsmetoder tillämpas för NLP-uppgifter ges. Då den bi-riktiga LSTM och CRF (BiLSTM-CRF) modell för Named Entity Recognition (NER) uppgift införs, liksom tillvägagångssättet för att göra denna modell förklarlig. De tillvägagångssätt för att visualisera vikten av neuroner i BiLSTM-skiktet av Modellen för NER genom Layer-Wise Relevance Propagation (LRP) föreslås, som kan mäta hur neuroner bidrar till varje förutsägelse av ett ord i en sekvens. Idéer om hur man mäter påverkan av CRF-skiktet i Bi-LSTM-CRF-modellen beskrivs också.

APA, Harvard, Vancouver, ISO, and other styles

39

Gainon, de Forsan de Gabriac Clara. "Deep Natural Language Processing for User Representation." Electronic Thesis or Diss., Sorbonne université, 2021. http://www.theses.fr/2021SORUS274.

Full text

Abstract:

La dernière décennie a vu s’imposer le développement des méthodes de Deep Learning (DL), aussi bien dans le monde académique qu’industriel. Ce succès peut s’expliquer par la capacité du DL à modéliser des entités toujours plus complexes. En particulier, les méthodes de Representation Learning se concentrent sur l’apprentissage de représentations latentes issues de données hétérogènes, à la fois versatiles et réutilisables, notamment en Natural Language Processing (NLP). En parallèle, le nombre grandissant de systèmes reposant sur des données utilisateurs entraînent leur lot de défis.Cette thèse propose des méthodes tirant partie du pouvoir de représentation du NLP pour apprendre des représentations d’utilisateur riches et versatiles. D'abord, nous étudions la Recommandation. Nous parlons ensuite des récentes avancées du NLP et des moyens de les appliquer de façon à tirer partie des textes écrits par les utilisateurs, pour enfin détailler les modèles génératifs. Puis, nous présentons un Système de Recommandation fondé sur la combinaison, d’une méthode de représentation par factorisation matricielle traditionnelle, et d’un modèle d’analyse de sentiments. Nos expériences montrent que, en plus d’améliorer les performances, ce modèle nous permet de comprendre ce qui intéresse l’utilisateur chez un produit, et de fournir des explications concernant les suggestions émises par le modèle. Enfin, nous présentons une nouvelle tâche centrée sur la représentation d’utilisateur : l’apprentissage de profil professionnel. Nous proposons un cadre de travail pour l’apprentissage et l’évaluation des profils professionnels sur différentes tâches, notamment la génération du prochain job
The last decade has witnessed the impressive expansion of Deep Learning (DL) methods, both in academic research and the private sector. This success can be explained by the ability DL to model ever more complex entities. In particular, Representation Learning methods focus on building latent representations from heterogeneous data that are versatile and re-usable, namely in Natural Language Processing (NLP). In parallel, the ever-growing number of systems relying on user data brings its own lot of challenges. This work proposes methods to leverage the representation power of NLP in order to learn rich and versatile user representations.Firstly, we detail the works and domains associated with this thesis. We study Recommendation. We then go over recent NLP advances and how they can be applied to leverage user-generated texts, before detailing Generative models.Secondly, we present a Recommender System (RS) that is based on the combination of a traditional Matrix Factorization (MF) representation method and a sentiment analysis model. The association of those modules forms a dual model that is trained on user reviews for rating prediction. Experiments show that, on top of improving performances, the model allows us to better understand what the user is really interested in in a given item, as well as to provide explanations to the suggestions made.Finally, we introduce a new task-centered on UR: Professional Profile Learning. We thus propose an NLP-based framework, to learn and evaluate professional profiles on different tasks, including next job generation

APA, Harvard, Vancouver, ISO, and other styles

40

Guy, Alison. "Logical expressions in natural language conditionals." Thesis, University of Sunderland, 1990. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.278644.

Full text

APA, Harvard, Vancouver, ISO, and other styles

41

Bannour, Nesrine. "Information Extraction from Electronic Health Records : Studies on temporal ordering, privacy and environmental impact." Electronic Thesis or Diss., université Paris-Saclay, 2023. http://www.theses.fr/2023UPASG082.

Full text

Abstract:

L'extraction automatique des informations contenues dans les Dossiers Patients Informatisés (DPIs) est cruciale pour améliorer la recherche clinique. Or, la plupart des informations sont sous forme de texte non structuré. La complexité et le caractère confidentiel du texte clinique présente des défis supplémentaires. Par conséquent, le partage de données est difficile dans la pratique et est strictement encadré par des réglementations. Les modèles neuronaux offrent de bons résultats pour l'extraction d'informations. Mais ils nécessitent de grandes quantités de données annotées, qui sont souvent limitées, en particulier pour les langues autres que l'anglais. Ainsi, la performance n'est pas encore adaptée à des applications pratiques. Outre les enjeux de confidentialité, les modèles d'apprentissage profond ont un important impact environnemental. Dans cette thèse, nous proposons des méthodes et des ressources pour la Reconnaissance d'entités nommées (REN) et l'extraction de relations temporelles dans des textes cliniques en français.Plus précisément, nous proposons une architecture de modèles préservant la confidentialité des données par mimétisme permettant un transfert de connaissances d'un modèle enseignant entraîné sur un corpus privé à un modèle élève. Ce modèle élève pourrait être partagé sans révéler les données sensibles ou le modèle privé construit avec ces données. Notre stratégie offre un bon compromis entre la performance et la préservation de la confidentialité. Ensuite, nous introduisons une nouvelle représentation des relations temporelles, indépendante des événements et de la tâche d'extraction, qui permet d'identifier des portions de textes homogènes du point de vue temporel et de caractériser la relation entre chaque portion du texte et la date de création du document. Cela rend l'annotation et l'extraction des relations temporelles plus facile et reproductible à travers différents types d'événements, vu qu'aucune définition et extraction préalable des événements n'est requise.Enfin, nous effectuons une analyse comparative des outils existants de mesure d'empreinte carbone des modèles de TAL. Nous adoptons un des outils étudiés pour calculer l'empreinte carbone de nos modèles, en considérant que c'est une première étape vers une prise de conscience et un contrôle de leur impact environnemental. En résumé, nous générons des modèles de REN partageables préservant la confidentialité que les cliniciens peuvent utiliser efficacement. Nous démontrons également que l'extraction de relations temporelles peut être abordée indépendamment du domaine d'application et que de bons résultats peuvent être obtenus en utilisant des données d'oncologie du monde réel
Automatically extracting rich information contained in Electronic Health Records (EHRs) is crucial to improve clinical research. However, most of this information is in the form of unstructured text.The complexity and the sensitive nature of clinical text involve further challenges. As a result, sharing data is difficult in practice and is governed by regulations. Neural-based models showed impressive results for Information Extraction, but they need significant amounts of manually annotated data, which is often limited, particularly for non-English languages. Thus, the performance is still not ideal for practical use. In addition to privacy issues, using deep learning models has a significant environmental impact.In this thesis, we develop methods and resources for clinical Named Entity Recognition (NER) and Temporal Relation Extraction (TRE) in French clinical narratives.Specifically, we propose a privacy-preserving mimic models architecture by exploring the mimic learning approach to enable knowledge transfer through a teacher model trained on a private corpus to a student model. This student model could be publicly shared without disclosing the original sensitive data or the private teacher model on which it was trained. Our strategy offers a good compromise between performance and data privacy preservation.Then, we introduce a novel event- and task-independent representation of temporal relations. Our representation enables identifying homogeneous text portions from a temporal standpoint and classifying the relation between each text portion and the document creation time. This makes the annotation and extraction of temporal relations easier and reproducible through different event types, as no prior definition and extraction of events is required.Finally, we conduct a comparative analysis of existing tools for measuring the carbon emissions of NLP models. We adopt one of the studied tools to calculate the carbon footprint of all our created models during the thesis, as we consider it a first step toward increasing awareness and control of their environmental impact.To summarize, we generate shareable privacy-preserving NER models that clinicians can efficiently use. We also demonstrate that the TRE task may be tackled independently of the application domain and that good results can be obtained using real-world oncology clinical notes

APA, Harvard, Vancouver, ISO, and other styles

42

Walker, Alden. "Natural language interaction with robots." Diss., Connect to the thesis, 2007. http://hdl.handle.net/10066/1275.

Full text

APA, Harvard, Vancouver, ISO, and other styles

43

Fuchs, Gil Emanuel. "Practical natural language processing question answering using graphs /." Diss., Digital Dissertations Database. Restricted to UC campuses, 2004. http://uclibs.org/PID/11984.

Full text

APA, Harvard, Vancouver, ISO, and other styles

44

Kolak, Okan. "Rapid resource transfer for multilingual natural language processing." College Park, Md. : University of Maryland, 2005. http://hdl.handle.net/1903/3182.

Full text

Abstract:

Thesis (Ph. D.) -- University of Maryland, College Park, 2005.
Thesis research directed by: Dept. of Linguistics. Title from t.p. of PDF. Includes bibliographical references. Published by UMI Dissertation Services, Ann Arbor, Mich. Also available in paper.

APA, Harvard, Vancouver, ISO, and other styles

45

Takeda, Koichi. "Building Natural Language Processing Applications Using Descriptive Models." 京都大学 (Kyoto University), 2010. http://hdl.handle.net/2433/120372.

Full text

APA, Harvard, Vancouver, ISO, and other styles

46

Åkerud, Daniel, and Henrik Rendlo. "Natural Language Processing from a Software Engineering Perspective." Thesis, Blekinge Tekniska Högskola, Avdelningen för programvarusystem, 2004. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-2056.

Full text

Abstract:

This thesis is intended to deal with questions related to the processing of naturally occurring texts, also known as natural language processing (NLP). The subject will be approached from a software engineering perspective, and the problem description will be formulated thereafter. The thesis is roughly divided into two major parts. The first part contains a literature study covering fundamental concepts and algorithms. We discuss both serial and parallel architectures, and conclude that different scenarios call for different architectures. The second part is an empirical evaluation of an NLP framework or toolkit chosen amongst a few, conducted in order to elucidate the theoretical part of the thesis. We argue that component based development in a portable language could increase the reusability in the NLP community, where reuse is currently low. The recent emergence of the discovered initiatives and the great potential of many applications in this area reveal a bright future for NLP.

APA, Harvard, Vancouver, ISO, and other styles

47

Byström, Adam. "From Intent to Code : Using Natural Language Processing." Thesis, Uppsala universitet, Avdelningen för datalogi, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-325238.

Full text

Abstract:

Programming and the possibility to express one’s intent to a machine is becoming a very important skill in our digitalizing society. Today, instructing a machine, such as a computer to perform actions is done through programming. What if this could be done with human language? This thesis examines how new technologies and methods in the form of Natural Language Processing can be used to make programming more accessible by translating intent expressed in natural language into code that a computer can execute. Related research has studied using natural language as a programming language and using natural language to instruct robots. These studies have shown promising results but are hindered by strict syntaxes, limited domains and inability to handle ambiguity. Studies have also been made using Natural Language Processing to analyse source code, turning code into natural language. This thesis has the reversed approach. By utilizing Natural Language Processing techniques, an intent can be translated into code containing concepts such as sequential execution, loops and conditional statements. In this study, a system for converting intent, expressed in English sentences, into code is developed. To analyse this approach to programming, an evaluation framework is developed, evaluating the system during the development process as well as usage of the final system. The results show that this way of programming might have potential but conclude that the Natural Language Processing models still have too low accuracy. Further research is required to increase this accuracy to further assess the potential of this way of programming.

APA, Harvard, Vancouver, ISO, and other styles

48

Bigert, Johnny. "Automatic and unsupervised methods in natural language processing." Doctoral thesis, Stockholm, 2005. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-156.

Full text

APA, Harvard, Vancouver, ISO, and other styles

49

Cohn, Trevor A. "Scaling conditional random fields for natural language processing /." Connect to thesis, 2007. http://eprints.unimelb.edu.au/archive/00002874.

Full text

APA, Harvard, Vancouver, ISO, and other styles

50

Zhang, Lidan, and 张丽丹. "Exploiting linguistic knowledge for statistical natural language processing." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2011. http://hub.hku.hk/bib/B46506299.

Full text

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!