Dissertations / Theses: 'Information Extraction'

1

Labský, Martin. "Information Extraction from Websites using Extraction Ontologies." Doctoral thesis, Vysoká škola ekonomická v Praze, 2002. http://www.nusl.cz/ntk/nusl-77102.

Full text

Abstract:

Automatic information extraction (IE) from various types of text became very popular during the last decade. Owing to information overload, there are many practical applications that can utilize semantically labelled data extracted from textual sources like the Internet, emails, intranet documents and even conventional sources like newspaper and magazines. Applications of IE exist in many areas of computer science: information retrieval systems, question answering or website quality assessment. This work focuses on developing IE methods and tools that are particularly suited to extraction from semi-structured documents such as web pages and to situations where available training data is limited. The main contribution of this thesis is the proposed approach of extended extraction ontologies. It attempts to combine extraction evidence from three distinct sources: (1) manually specified extraction knowledge, (2) existing training data and (3) formatting regularities that are often present in online documents. The underlying hypothesis is that using extraction evidence of all three types by the extraction algorithm can help improve its extraction accuracy and robustness. The motivation for this work has been the lack of described methods and tools that would exploit these extraction evidence types at the same time. This thesis first describes a statistically trained approach to IE based on Hidden Markov Models which integrates with a picture classification algorithm in order to extract product offers from the Internet, including textual items as well as images. This approach is evaluated using a bicycle sale domain. Several methods of image classification using various feature sets are described and evaluated as well. These trained approaches are then integrated in the proposed novel approach of extended extraction ontologies, which builds on top of the work of Embley [21] by exploiting manual, trained and formatting types of extraction evidence at the same time. The intended benefit of using extraction ontologies is a quick development of a functional IE prototype, its smooth transition to deployed IE application and the possibility to leverage the use of each of the three extraction evidence types. Also, since extraction ontologies are typically developed by adapting suitable domain ontologies and the ontology remains in center of the extraction process, the work related to the conversion of extracted results back to a domain ontology or schema is minimized. The described approach is evaluated using several distinct real-world datasets.

APA, Harvard, Vancouver, ISO, and other styles

2

Arpteg, Anders. "Intelligent semi-structured information extraction : a user-driven approach to information extraction /." Linköping : Dept. of Computer and Information Science, Univ, 2005. http://www.bibl.liu.se/liupubl/disp/disp2005/tek946s.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

3

Swampillai, Kumutha. "Information extraction across sentences." Thesis, University of Sheffield, 2011. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.575468.

Full text

Abstract:

Most relation extraction systems identify relations by searching within- sentences (within-sentence relations). Such an approach excludes finding any relations that cross sentence boundaries (cross-sentence relations). This thesis quantifies the cross-sentence relations in two major information ex- traction corpora: ACE03 (9.4%) and MUC6 (27.4%), revealing the extent of this limitation. In response. a composite kernel approach to cross-sentence relation extraction is proposed which models relations using parse tree and fiat surface features. Support vector machine classifiers are trained using cross-sentential relations from the !vIUC6 corpus to determine the effective- ness of this approach. It was shown .that composite kernels are able to extract cross-sentential relations with f-measure scores of 0.512, 0.116 and 0.633 for PerOrg. PerPost and PostOrg models. respectively. Moreover. combining within-sentence and cross-sentence extraction models increases the number of relations correctly identified by 24% over within-sentence relation extraction alone.

APA, Harvard, Vancouver, ISO, and other styles

4

Tablan, Mihai Valentin. "Toward portable information extraction." Thesis, University of Sheffield, 2009. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.522379.

Full text

APA, Harvard, Vancouver, ISO, and other styles

5

Leen, Gayle. "Context assisted information extraction." Thesis, University of the West of Scotland, 2008. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.446043.

Full text

APA, Harvard, Vancouver, ISO, and other styles

6

Sottovia, Paolo. "Information Extraction from data." Doctoral thesis, Università degli studi di Trento, 2019. http://hdl.handle.net/11572/242992.

Full text

Abstract:

Data analysis is the process of inspecting, cleaning, extract, and modeling data with the intention of extracting useful information in order to support users in their decisions. With the advent of Big Data, data analysis was becoming more complicated due to the volume and variety of data. This process begins with the acquisition of the data and the selection of the data that is useful for the desiderata analysis. With such amount of data, also expert users are not able to inspect the data and understand if a dataset is suitable or not for their purposes. In this dissertation, we focus on five problems in the broad data analysis process to help users find insights from the data when they do not have enough knowledge about its data. First, we analyze the data description problem, where the user is looking for a description of the input dataset. We introduce data descriptions: a compact, readable and insightful formula of boolean predicates that represents a set of data records. Finding the best description for a dataset is computationally expensive and task-specific; we, therefore, introduce a set of metrics and heuristics for generating meaningful descriptions at an interactive performance. Secondly, we look at the problem of order dependency discovery, which discovers another kind of metadata that may help the user in the understanding of characteristics of a dataset. Our approach leverages the observation that discovering order dependencies can be guided by the discovery of a more specific form of dependencies called order compatibility dependencies. Thirdly, textual data encodes much hidden information. To allow this data to reach its full potential, there has been an increasing interest in extracting structural information from it. In this regard, we propose a novel approach for extracting events that are based on temporal co-reference among entities. We consider an event to be a set of entities that collectively experience relationships between them in a specific period of time. We developed a distributed strategy that is able to scale with the largest on-line encyclopedia available, Wikipedia. Then, we deal with the evolving nature of the data by focusing on the problem of finding synonymous attributes in evolving Wikipedia Infoboxes. Over time, several attributes have been used to indicate the same characteristic of an entity. This provides several issues when we are trying to analyze the content of different time periods. To solve it, we propose a clustering strategy that combines two contrasting distance metrics. We developed an approximate solution that we assess over 13 years of Wikipedia history by proving its flexibility and accuracy. Finally, we tackle the problem of identifying movements of attributes in evolving datasets. In an evolving environment, entities not only change their characteristics, but they sometimes exchange them over time. We proposed a strategy where we are able to discover those cases, and we also test our strategy on real datasets. We formally present the five problems that we validate both in terms of theoretical results and experimental evaluation, and we demonstrate that the proposed approaches efficiently scale with a large amount of data.

APA, Harvard, Vancouver, ISO, and other styles

7

Sottovia, Paolo. "Information Extraction from data." Doctoral thesis, Università degli studi di Trento, 2019. http://hdl.handle.net/11572/242992.

Full text

Abstract:

Data analysis is the process of inspecting, cleaning, extract, and modeling data with the intention of extracting useful information in order to support users in their decisions. With the advent of Big Data, data analysis was becoming more complicated due to the volume and variety of data. This process begins with the acquisition of the data and the selection of the data that is useful for the desiderata analysis. With such amount of data, also expert users are not able to inspect the data and understand if a dataset is suitable or not for their purposes. In this dissertation, we focus on five problems in the broad data analysis process to help users find insights from the data when they do not have enough knowledge about its data. First, we analyze the data description problem, where the user is looking for a description of the input dataset. We introduce data descriptions: a compact, readable and insightful formula of boolean predicates that represents a set of data records. Finding the best description for a dataset is computationally expensive and task-specific; we, therefore, introduce a set of metrics and heuristics for generating meaningful descriptions at an interactive performance. Secondly, we look at the problem of order dependency discovery, which discovers another kind of metadata that may help the user in the understanding of characteristics of a dataset. Our approach leverages the observation that discovering order dependencies can be guided by the discovery of a more specific form of dependencies called order compatibility dependencies. Thirdly, textual data encodes much hidden information. To allow this data to reach its full potential, there has been an increasing interest in extracting structural information from it. In this regard, we propose a novel approach for extracting events that are based on temporal co-reference among entities. We consider an event to be a set of entities that collectively experience relationships between them in a specific period of time. We developed a distributed strategy that is able to scale with the largest on-line encyclopedia available, Wikipedia. Then, we deal with the evolving nature of the data by focusing on the problem of finding synonymous attributes in evolving Wikipedia Infoboxes. Over time, several attributes have been used to indicate the same characteristic of an entity. This provides several issues when we are trying to analyze the content of different time periods. To solve it, we propose a clustering strategy that combines two contrasting distance metrics. We developed an approximate solution that we assess over 13 years of Wikipedia history by proving its flexibility and accuracy. Finally, we tackle the problem of identifying movements of attributes in evolving datasets. In an evolving environment, entities not only change their characteristics, but they sometimes exchange them over time. We proposed a strategy where we are able to discover those cases, and we also test our strategy on real datasets. We formally present the five problems that we validate both in terms of theoretical results and experimental evaluation, and we demonstrate that the proposed approaches efficiently scale with a large amount of data.

APA, Harvard, Vancouver, ISO, and other styles

8

Arpteg, Anders. "Adaptive Semi-structured Information Extraction." Licentiate thesis, Linköping University, Linköping University, KPLAB - Knowledge Processing Lab, 2003. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-5688.

Full text

Abstract:

The number of domains and tasks where information extraction tools can be used needs to be increased. One way to reach this goal is to construct user-driven information extraction systems where novice users are able to adapt them to new domains and tasks. To accomplish this goal, the systems need to become more intelligent and able to learn to extract information without need of expert skills or time-consuming work from the user.

The type of information extraction system that is in focus for this thesis is semistructural information extraction. The term semi-structural refers to documents that not only contain natural language text but also additional structural information. The typical application is information extraction from World Wide Web hypertext documents. By making effective use of not only the link structure but also the structural information within each such document, user-driven extraction systems with high performance can be built.

The extraction process contains several steps where different types of techniques are used. Examples of such types of techniques are those that take advantage of structural, pure syntactic, linguistic, and semantic information. The first step that is in focus for this thesis is the navigation step that takes advantage of the structural information. It is only one part of a complete extraction system, but it is an important part. The use of reinforcement learning algorithms for the navigation step can make the adaptation of the system to new tasks and domains more user-driven. The advantage of using reinforcement learning techniques is that the extraction agent can efficiently learn from its own experience without need for intensive user interactions.

An agent-oriented system was designed to evaluate the approach suggested in this thesis. Initial experiments showed that the training of the navigation step and the approach of the system was promising. However, additional components need to be included in the system before it becomes a fully-fledged user-driven system.

Report code: LiU-Tek-Lic-2002:73.

APA, Harvard, Vancouver, ISO, and other styles

9

Schierle, Martin. "Language Engineering for Information Extraction." Doctoral thesis, Universitätsbibliothek Leipzig, 2012. http://nbn-resolving.de/urn:nbn:de:bsz:15-qucosa-81757.

Full text

Abstract:

Accompanied by the cultural development to an information society and knowledge economy and driven by the rapid growth of the World Wide Web and decreasing prices for technology and disk space, the world\'s knowledge is evolving fast, and humans are challenged with keeping up. Despite all efforts on data structuring, a large part of this human knowledge is still hidden behind the ambiguities and fuzziness of natural language. Especially domain language poses new challenges by having specific syntax, terminology and morphology. Companies willing to exploit the information contained in such corpora are often required to build specialized systems instead of being able to rely on off the shelf software libraries and data resources. The engineering of language processing systems is however cumbersome, and the creation of language resources, annotation of training data and composition of modules is often enough rather an art than a science. The scientific field of Language Engineering aims at providing reliable information, approaches and guidelines of how to design, implement, test and evaluate language processing systems. Language engineering architectures have been a subject of scientific work for the last two decades and aim at building universal systems of easily reusable components. Although current systems offer comprehensive features and rely on an architectural sound basis, there is still little documentation about how to actually build an information extraction application. Selection of modules, methods and resources for a distinct usecase requires a detailed understanding of state of the art technology, application demands and characteristics of the input text. The main assumption underlying this work is the thesis that a new application can only occasionally be created by reusing standard components from different repositories. This work recapitulates existing literature about language resources, processing resources and language engineering architectures to derive a theory about how to engineer a new system for information extraction from a (domain) corpus. This thesis was initiated by the Daimler AG to prepare and analyze unstructured information as a basis for corporate quality analysis. It is therefore concerned with language engineering in the area of Information Extraction, which targets the detection and extraction of specific facts from textual data. While other work in the field of information extraction is mainly concerned with the extraction of location or person names, this work deals with automotive components, failure symptoms, corrective measures and their relations in arbitrary arity. The ideas presented in this work will be applied, evaluated and demonstrated on a real world application dealing with quality analysis on automotive domain language. To achieve this goal, the underlying corpus is examined and scientifically characterized, algorithms are picked with respect to the derived requirements and evaluated where necessary. The system comprises language identification, tokenization, spelling correction, part of speech tagging, syntax parsing and a final relation extraction step. The extracted information is used as an input to data mining methods such as an early warning system and a graph based visualization for interactive root cause analysis. It is finally investigated how the unstructured data facilitates those quality analysis methods in comparison to structured data. The acceptance of these text based methods in the company\'s processes further proofs the usefulness of the created information extraction system.

APA, Harvard, Vancouver, ISO, and other styles

10

Lam, Man I. "Business information extraction from web." Thesis, University of Macau, 2008. http://umaclib3.umac.mo/record=b1937939.

Full text

APA, Harvard, Vancouver, ISO, and other styles

11

Jessop, David M. "Information extraction from chemical patents." Thesis, University of Cambridge, 2011. https://www.repository.cam.ac.uk/handle/1810/238302.

Full text

Abstract:

The automated extraction of semantic chemical data from the existing literature is demonstrated. For reasons of copyright, the work is focused on the patent literature, though the methods are expected to apply equally to other areas of the chemical literature. Hearst Patterns are applied to the patent literature in order to discover hyponymic relations describing chemical species. The acquired relations are manually validated to determine the precision of the determined hypernyms (85.0%) and of the asserted hyponymic relations (94.3%). It is demonstrated that the system acquires relations that are not present in the ChEBI ontology, suggesting that it could function as a valuable aid to the ChEBI curators. The relations discovered by this process are formalised using the Web Ontology Language (OWL) to enable re-use. PatentEye - an automated system for the extraction of reactions from chemical patents and their conversion to Chemical Markup Language (CML) - is presented. Chemical patents published by the European Patent Office over a ten-week period are used to demonstrate the capability of PatentEye - 4444 reactions are extracted with a precision of 78% and recall of 64% with regards to determining the identity and amount of reactants employed and an accuracy of 92% with regards to product identification. NMR spectra are extracted from the text using OSCAR3, which is developed to greatly increase recall. The resulting system is presented as a significant advancement towards the large-scale and automated extraction of high-quality reaction information. Extended Polymer Markup Language (EPML), a CML dialect for the description of Markush structures as they are presented in the literature, is developed. Software to exemplify and to enable substructure searching of EPML documents is presented. Further work is recommended to refine the language and code to publication-quality before they are presented to the community.

APA, Harvard, Vancouver, ISO, and other styles

12

Nguyen, Thien Huu. "Deep Learning for Information Extraction." Thesis, New York University, 2018. http://pqdtopen.proquest.com/#viewpdf?dispub=10260911.

Full text

Abstract:

The explosion of data has made it crucial to analyze the data and distill important information effectively and efficiently. A significant part of such data is presented in unstructured and free-text documents. This has prompted the development of the techniques for information extraction that allow computers to automatically extract structured information from the natural free-text data. Information extraction is a branch of natural language processing in artificial intelligence that has a wide range of applications, including question answering, knowledge base population, information retrieval etc. The traditional approach for information extraction has mainly involved hand-designing large feature sets (feature engineering) for different information extraction problems, i.e, entity mention detection, relation extraction, coreference resolution, event extraction, and entity linking. This approach is limited by the laborious and expensive effort required for feature engineering for different domains, and suffers from the unseen word/feature problem of natural languages.

This dissertation explores a different approach for information extraction that uses deep learning to automate the representation learning process and generate more effective features. Deep learning is a subfield of machine learning that uses multiple layers of connections to reveal the underlying representations of data. I develop the fundamental deep learning models for information extraction problems and demonstrate their benefits through systematic experiments.

First, I examine word embeddings, a general word representation that is produced by training a deep learning model on a large unlabelled dataset. I introduce methods to use word embeddings to obtain new features that generalize well across domains for relation extraction. This is done for both the feature-based method and the kernel-based method of relation extraction.

Second, I investigate deep learning models for different problems, including entity mention detection, relation extraction and event detection. I develop new mechanisms and network architectures that allow deep learning to model the structures of information extraction problems more effectively. Some extensive experiments are conducted on the domain adaptation and transfer learning settings to highlight the generalization advantage of the deep learning models for information extraction.

Finally, I investigate the joint frameworks to simultaneously solve several information extraction problems and benefit from the inter-dependencies among these problems. I design a novel memory augmented network for deep learning to properly exploit such inter-dependencies. I demonstrate the effectiveness of this network on two important problems of information extraction, i.e, event extraction and entity linking.

APA, Harvard, Vancouver, ISO, and other styles

13

Lee, Ji Young Ph D. Massachusetts Institute of Technology. "Information extraction with neural networks." Thesis, Massachusetts Institute of Technology, 2017. http://hdl.handle.net/1721.1/111905.

Full text

Abstract:

Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2017.
Cataloged from PDF version of thesis.
Includes bibliographical references (pages 85-97).
Electronic health records (EHRs) have been widely adopted, and are a gold mine for clinical research. However, EHRs, especially their text components, remain largely unexplored due to the fact that they must be de-identified prior to any medical investigation. Existing systems for de-identification rely on manual rules or features, which are time-consuming to develop and fine-tune for new datasets. In this thesis, we propose the first de-identification system based on artificial neural networks (ANNs), which achieves state-of-the-art results without any human-engineered features. The ANN architecture is extended to incorporate features, further improving the de-identification performance. Under practical considerations, we explore transfer learning to take advantage of large annotated dataset to improve the performance on datasets with limited number of annotations. The ANN-based system is publicly released as an easy-to-use software package for general purpose named-entity recognition as well as de-identification. Finally, we present an ANN architecture for relation extraction, which ranked first in the SemEval-2017 task 10 (ScienceIE) for relation extraction in scientific articles (subtask C).
by Ji Young Lee.
Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

14

Harik, Ralph 1979. "Structural and semantic information extraction." Thesis, Massachusetts Institute of Technology, 2003. http://hdl.handle.net/1721.1/87407.

Full text

APA, Harvard, Vancouver, ISO, and other styles

15

Valenzuela, Escárcega Marco Antonio. "Interpretable Models for Information Extraction." Diss., The University of Arizona, 2016. http://hdl.handle.net/10150/613348.

Full text

Abstract:

There is an abundance of information being generated constantly, most of it encoded as unstructured text. The information expressed this way, although publicly available, is not directly usable by computer systems because it is not organized according to a data model that could inform us how different data nuggets relate to each other. Information extraction provides a way of scanning unstructured text and extracting structured knowledge suitable for querying and manipulation. Most information extraction research focuses on machine learning approaches that can be considered black boxes when deployed in information extraction systems. We propose a declarative language designed for the information extraction task. It allows the use of syntactic patterns alongside token-based surface patterns that incorporate shallow linguistic features. It captures complex constructs such as nested structures, and complex regular expressions over syntactic patterns for event arguments. We implement a novel information extraction runtime system designed for the compilation and execution of the proposed language. The runtime system has novel features for better declarative support, while preserving practicality. It supports features required for handling natural language, like the preservation of ambiguity and the efficient use of contextual information. It has a modular architecture that allows it to be extended with new functionality, which, together with the language design, provides a powerful framework for the development and research of new ideas for declarative information extraction. We use our language and runtime system to build a biomedical information extraction system. This system is capable of recognizing biological entities (e.g., genes, proteins, protein families, simple chemicals), events over entities (e.g., biochemical reactions), and nested events that take other events as arguments (e.g., catalysis). Additionally, it captures complex natural language phenomena like coreference and hedging. Finally, we propose a rule learning procedure to extract rules from statistical systems trained for information extraction. Rule learning combines the advantages of machine learning with the interpretability of our models. This enables us to train information extraction systems using annotated data that can then be extended and modified by human experts, and in this way accelerate the deployment of new systems that can still be extended or modified by human experts.

APA, Harvard, Vancouver, ISO, and other styles

16

Perera, Pathirage Dinindu Sujan Udayanga. "Knowledge-driven Implicit Information Extraction." Wright State University / OhioLINK, 2016. http://rave.ohiolink.edu/etdc/view?acc_num=wright1472474558.

Full text

APA, Harvard, Vancouver, ISO, and other styles

17

Batista-Navarro, Riza Theresa Bautista. "Information extraction from pharmaceutical literature." Thesis, University of Manchester, 2014. https://www.research.manchester.ac.uk/portal/en/theses/information-extraction-from-pharmaceutical-literature(3f8322b6-8b8d-44eb-a8cd-899026b267b9).html.

Full text

Abstract:

With the constantly growing amount of biomedical literature, methods for automatically distilling information from unstructured data, collectively known as information extraction, have become indispensable. Whilst most biomedical information extraction efforts in the last decade have focussed on the identification of gene products and interactions between them, the biomedical text mining community has recently extended their scope to capture associations between biomedical and chemical entities with the aim of supporting applications in drug discovery. This thesis is the first comprehensive study focussing on information extraction from pharmaceutical chemistry literature. In this research, we describe our work on (1) recognising names of chemical compounds and drugs, facilitated by the incorporation of domain knowledge; (2) exploring different coreference resolution paradigms in order to recognise co-referring expressions given a full-text article; and (3) defining drug-target interactions as events and distilling them from pharmaceutical chemistry literature using event extraction methods.

APA, Harvard, Vancouver, ISO, and other styles

18

Kushmerick, Nicholas. "Wrapper induction for information extraction /." Thesis, Connect to this title online; UW restricted, 1997. http://hdl.handle.net/1773/6867.

Full text

APA, Harvard, Vancouver, ISO, and other styles

19

Wang, Yefeng. "Information extraction from clinical notes." Thesis, The University of Sydney, 2010. https://hdl.handle.net/2123/28844.

Full text

Abstract:

Information Extraction (IE) is an important task for Natural Language Processing (NLP). Effective IE methods, aimed at constructing structured information for unstructured natural language text, can reduce a large amount of human effort in processing the digital information available today. Successful application of IE to the clinical domain can advance clinical research and provide underlying techniques to support better health information systems. This thesis investigates the problems of IE from clinical notes.

APA, Harvard, Vancouver, ISO, and other styles

20

Spanò, Alvise <1977&gt. "Information extraction by type analysis." Doctoral thesis, Università Ca' Foscari Venezia, 2013. http://hdl.handle.net/10579/3047.

Full text

Abstract:

This thesis investigates an alternative use of type reconstruction, as a tool for extracting knowledge from programs written in weakly typed language. We explore this avenue along two different, but related directions. In the first part we present a static analyzer that exploits typing techniques to extract information from the COBOL source code: reconstructing informative types is an effective way for automatically generating a basic tier of documentation for legacy software, and is also a reliable starting point for performing further, higher-level program understanding processing. In the second part of we apply similar principles to an apparently distant context: validating inter-component communication of Android applications by reconstructing the types of data within Intents - the building blocks of message passing in Android. Both for COBOL and Android, we present a distinct implementation of the static analysis system proposed.
La tesi propone un utilizzo alternativo delle tecniche di type reconstruction come strumento per l'estrazione della conoscenza da programmi scritti in linguaggi debolmente tipati. L'approfondimento si dirama in due fronti distinti ma correlati. Nella prima parte si presenta un sistema che sfrutta una tecnica di typing per estrarre informazioni da programmi sorgente COBOL: ricostruire tipi informativi è un buon modo per generare automaticamente della documentazione preliminare sul software legacy ed è anche un buon punto di partenza su cui applicare ulteriori approcci di Program Understanding. Nella seconda parte si applicano principi simili ad un contesto apparentemente distante: verificare la comunicazione tra componenti di applicazioni Android tramite la ricostruzione dei tipi dei dati contenuti negli Intent - i mattoni sui quali si basa lo scambio di messaggi in Android. Infine, sia per COBOL che per Android presentiamo una implementazione distinta del sistema di analisi statica proposto.

APA, Harvard, Vancouver, ISO, and other styles

21

Hoang, Thi Bich Ngoc. "Information diffusion, information and knowledge extraction from social networks." Thesis, Toulouse 2, 2018. http://www.theses.fr/2018TOU20078.

Full text

Abstract:

La popularité des réseaux sociaux a rapidement augmenté au cours de la dernière décennie. Selon Statista, environ 2 milliards d'utilisateurs utiliseront les réseaux sociaux d'ici janvier 2018 et ce nombre devrait encore augmenter au cours des prochaines années. Tout en gardant comme objectif principal de connecter le monde, les réseaux sociaux jouent également un rôle majeur dans la connexion des commerçants avec les clients, les célébrités avec leurs fans, les personnes ayant besoin d'aide avec les personnes désireuses d'aider, etc.. Le succès de ces réseaux repose principalement sur l'information véhiculée ainsi que sur la capacité de diffusion des messages dans les réseaux sociaux. Notre recherche vise à modéliser la diffusion des messages ainsi qu'à extraire et à représenter l'information des messages dans les réseaux sociaux. Nous introduisons d'abord une approche de prédiction de la diffusion de l'information dans les réseaux sociaux. Plus précisément, nous prédisons si un tweet va être re-tweeté ou non ainsi que son niveau de diffusion. Notre modèle se base sur trois types de caractéristiques: basées sur l'utilisateur, sur le temps et sur le contenu. Nous avons évalué notre modèle sur différentes collections correspondant à une douzaine de millions de tweets. Nous avons montré que notre modèle améliore significativement la F-mesure par rapport à l'état de l'art, à la fois pour prédire si un tweet va être re-tweeté et pour prédire le niveau de diffusion. La deuxième contribution de cette thèse est de fournir une approche pour extraire des informations dans les microblogs. Plusieurs informations importantes sont incluses dans un message relatif à un événement, telles que la localisation, l'heure et les entités associées. Nous nous concentrons sur l'extraction de la localisation qui est un élément primordial pour plusieurs applications, notamment les applications géospatiales et les applications liées aux événements. Nous proposons plusieurs combinaisons de méthodes existantes d'extraction de localisation dans des tweets en ciblant des applications soit orientées rappel soit orientées précision. Nous présentons également un modèle pour prédire si un tweet contient une référence à un lieu ou non. Nous montrons que nous améliorons significativement la précision des outils d'extraction de lieux lorsqu'ils se focalisent sur les tweets que nous prédisons contenir un lieu. Notre dernière contribution présente une base de connaissances permettant de mieux représenter l'information d'un ensemble de tweets liés à des événements. Nous combinons une collection de tweets de festivals avec d'autres ressources issues d'Internet pour construire une ontologie de domaine. Notre objectif est d'apporter aux utilisateurs une image complète des événements référencés au sein de cette collection
The popularity of online social networks has rapidly increased over the last decade. According to Statista, approximated 2 billion users used social networks in January 2018 and this number is still expected to grow in the next years. While serving its primary purpose of connecting people, social networks also play a major role in successfully connecting marketers with customers, famous people with their supporters, need-help people with willing-help people. The success of online social networks mainly relies on the information the messages carry as well as the spread speed in social networks. Our research aims at modeling the message diffusion, extracting and representing information and knowledge from messages on social networks. Our first contribution is a model to predict the diffusion of information on social networks. More precisely, we predict whether a tweet is going to be diffused or not and the level of the diffusion. Our model is based on three types of features: user-based, time-based and content-based features. Being evaluated on various collections corresponding to dozen millions of tweets, our model significantly improves the effectiveness (F-measure) compared to the state-of-the-art, both when predicting if a tweet is going to be retweeted or not, and when predicting the level of retweet. The second contribution of this thesis is to provide an approach to extract information from microblogs. While several pieces of important information are included in a message about an event such as location, time, related entities, we focus on location which is vital for several applications, especially geo-spatial applications and applications linked to events. We proposed different combinations of various existing methods to extract locations in tweets targeting either recall-oriented or precision-oriented applications. We also defined a model to predict whether a tweet contains a location or not. We showed that the precision of location extraction tools on the tweets we predict to contain a location is significantly improved as compared when extracted from all the tweets.Our last contribution presents a knowledge base that better represents information from a set of tweets on events. We combined a tweet collection with other Internet resources to build a domain ontology. The knowledge base aims at bringing users a complete picture of events referenced in the tweet collection (we considered the CLEF 2016 festival tweet collection)

APA, Harvard, Vancouver, ISO, and other styles

22

Toledo, Testa Juan Ignacio. "Information extraction from heterogeneous handwritten documents." Doctoral thesis, Universitat Autònoma de Barcelona, 2019. http://hdl.handle.net/10803/667388.

Full text

Abstract:

L’objectiu d’aquesta tesi és l’extracció d’Informació de documents total o parcialment manuscrits amb una certa estructura. Bàsicament treballem amb dos escenaris d’aplicació diferent. El primer escenari són els documents moderns altament estructurats, com formularis. En aquests documents, la informació semàntica està ja definida en camps, amb una posició concreta al document i l’extracció de la informació és equivalent a una transcripció. El segon escenari son els documents semi-estructurats totalment manuscrits on, a més de transcriure, cal associar un valor semàntic, d’entre un conjunt conegut de valors possibles, a les paraules que es transcriuen. En ambdós casos la qualitat de la transcripció té un gran pes en la precisió del sistema, per això proposem models basats en xarxes neuronals per a transcriure text manuscrit. Per a poder afrontar el repte dels documents semi-estructurats hem generat un benchmark, compost de dataset, una sèrie de tasques definides i una mètrica que es va presentar a la comunitat científica com a una competició internacional. També proposem diferents models basats en Xarxes Neuronals Convolucionals i recurrents, capaços de transcriure i assignar diferent etiquetes semàntiques a cada paraula manuscrita, és a dir, capaços d'extreure informació.
El objetivo de esta tesis es la extracción de Información de documentos total o parcialmente manuscritos, con una cierta estructura. Básicamente trabajamos con dos escenarios de aplicación diferentes. El primer escenario son los documentos modernos altamente estructurados, como los formularios. En estos documentos, la información semántica está pre-definida en campos con una posición concreta en el documento i la extracción de información es equivalente a una transcripción. El segundo escenario son los documentos semi-estructurados totalmente manuscritos, donde, además de transcribir, es necesario asociar un valor semántico, de entre un conjunto conocido de valores posibles, a las palabras manuscritas. En ambos casos, la calidad de la transcripción tiene un gran peso en la precisión del sistema. Por ese motivo proponemos modelos basados en redes neuronales para transcribir el texto manuscrito. Para poder afrontar el reto de los documentos semi-estructurados, hemos generado un benchmark, compuesto de dataset, una serie de tareas y una métrica que fue presentado a la comunidad científica a modo de competición internacional. También proponemos diferentes modelos basados en Redes Neuronales Convolucionales y Recurrentes, capaces de transcribir y asignar diferentes etiquetas semánticas a cada palabra manuscrita, es decir, capaces de extraer información.
The goal of this thesis is information Extraction from totally or partially handwritten documents. Basically we are dealing with two different application scenarios. The first scenario are modern highly structured documents like forms. In this kind of documents, the semantic information is encoded in different fields with a pre-defined location in the document, therefore, information extraction becomes equivalent to transcription. The second application scenario are loosely structured totally handwritten documents, besides transcribing them, we need to assign a semantic label, from a set of known values to the handwritten words. In both scenarios, transcription is an important part of the information extraction. For that reason in this thesis we present two methods based on Neural Networks, to transcribe handwritten text.In order to tackle the challenge of loosely structured documents, we have produced a benchmark, consisting of a dataset, a defined set of tasks and a metric, that was presented to the community as an international competition. Also, we propose different models based on Convolutional and Recurrent neural networks that are able to transcribe and assign different semantic labels to each handwritten words, that is, able to perform Information Extraction.

APA, Harvard, Vancouver, ISO, and other styles

23

Walessa, Marc. "Bayesian information extraction from SAR images." [S.l. : s.n.], 2001. http://deposit.ddb.de/cgi-bin/dokserv?idn=964273659.

Full text

APA, Harvard, Vancouver, ISO, and other styles

24

Popescu, Ana-Maria. "Information extraction from unstructured web text /." Thesis, Connect to this title online; UW restricted, 2007. http://hdl.handle.net/1773/6935.

Full text

APA, Harvard, Vancouver, ISO, and other styles

25

Williams, Dean Ashley. "Combining data integration and information extraction." Thesis, Birkbeck (University of London), 2008. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.499152.

Full text

APA, Harvard, Vancouver, ISO, and other styles

26

Duarte, Lucio Mauro. "Behaviour Model Extraction Using Context Information." Thesis, Imperial College London, 2007. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.498466.

Full text

APA, Harvard, Vancouver, ISO, and other styles

27

Sukhahuta, Rattasit. "Information extraction system for Thai documents." Thesis, University of East Anglia, 2001. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.368173.

Full text

APA, Harvard, Vancouver, ISO, and other styles

28

Ciravegna, Fabio. "User-defined information extraction from texts." Thesis, University of East Anglia, 2003. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.273293.

Full text

APA, Harvard, Vancouver, ISO, and other styles

29

Babych, Bogdan. "Information extraction technology in machine translation." Thesis, University of Leeds, 2005. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.416402.

Full text

APA, Harvard, Vancouver, ISO, and other styles

30

Collier, Robin. "Automatic template creation for information extraction." Thesis, University of Sheffield, 1998. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.286986.

Full text

APA, Harvard, Vancouver, ISO, and other styles

31

O'Malley, C. J. "Information extraction for enhanced bioprocess development." Thesis, University College London (University of London), 2008. http://discovery.ucl.ac.uk/14247/.

Full text

Abstract:

One by-product of the large-scale manufacture of biological products is the generation of significant quantities of process data. Typically this data is catalogued and stored in accordance with regulatory requirements, but rarely is it used to enhance subsequent production. A large amount of useful information is inherent in this data; the problems lie in the lack of appropriate methods to apply in order to extract it. The identification and/or development of tools capable of providing access to this valuable, untapped resource are therefore an important area for research. The main objective of this research is to investigate whether it is possible to attain knowledge from the information inherent within process data. The approach adopted in this thesis is to utilise the tools and techniques prevalent in the areas of data mining and pattern recognition. Through the application of these techniques, it is hypothesised that useful information can be acquired. Specifically the industrial sponsors of the research, Avecia Biologics, are interested in looking at methods for comparing new proteins to those they have previously worked on, with the intention of inferring information pertaining to the large scale manufacturing route for different processes. It is hypothesised that by comparing proteins and looking for similarities at the molecular level, it could be possible to identify potential pit-falls and bottlenecks in the recovery process before they occur. This would then allow Avecia to highlight areas of process development that may require specific attention. Two main techniques are the primary focus of the study; the Self-Organising Map (SOM) and the Support Vector Machine (SVM). Through a detailed investigation of these techniques, from benchmarking studies to applications with real-world problems, it is shown that these methods have the potential to become a useful tool for extracting information from biological process data.

APA, Harvard, Vancouver, ISO, and other styles

32

Forsling, Robin. "Decentralized Estimation Using Conservative Information Extraction." Licentiate thesis, Linköpings universitet, Reglerteknik, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-171998.

Full text

Abstract:

Sensor networks consist of sensors (e.g., radar and cameras) and processing units (e.g., estimators), where in the former information extraction occurs and in the latter estimates are formed. In decentralized estimation information extracted by sensors has been pre-processed at an intermediate processing unit prior to arriving at an estimator. Pre-processing of information allows for the complexity of large systems and systems-of-systems to be significantly reduced, and also makes the sensor network robust and flexible. One of the main disadvantages of pre-processing information is that information becomes correlated. These correlations, if not handled carefully, potentially lead to underestimated uncertainties about the calculated estimates. In conservative estimation the unknown correlations are handled by ensuring that the uncertainty about an estimate is not underestimated. If this is ensured the estimate is said to be conservative. Neglecting correlations means information is double counted which in worst case implies diverging estimates with fatal consequences. While ensuring conservative estimates is the main goal, it is desirable for a conservative estimator, as for any estimator, to provide an error covariance which is as small as possible. Application areas where conservative estimation is relevant are setups where multiple agents cooperate to accomplish a common objective, e.g., target tracking, surveillance and air policing. The first part of this thesis deals with theoretical matters where the conservative linear unbiased estimation problem is formalized. This part proposes an extension of classical linear estimation theory to the conservative estimation problem. The conservative linear unbiased estimator (CLUE) is suggested as a robust and practical alternative for estimation problems where the correlations are unknown. Optimality criteria for the CLUE are provided and further investigated. It is shown that finding an optimal CLUE is more complicated than finding an optimal linear unbiased estimator in the classical version of the problem. To simplify the problem, a CLUE that is optimal under certain restrictions will also be investigated. The latter is named restricted best CLUE. An important result is a theorem that gives a closed form solution to a restricted best CLUE. Furthermore, several conservative estimation methods are described followed by an analysis of their properties. The methods are shown to be conservative and optimal under different assumptions about the underlying correlations. The second part of the thesis focuses on practical aspects of the conservative approach to decentralized estimation in configurations where the communication channel is constrained. The diagonal covariance approximation is proposed as a data reduction technique that complies with the communication constraints and if handled correctly can be shown to preserve conservative estimates. Several information selection methods are derived that can reduce the amount of data being transmitted in the communication channel. Using the information selection methods it is possible to decide what information other actors of the sensor network find useful.

APA, Harvard, Vancouver, ISO, and other styles

33

Ferreira, Liliana da Silva. "Medical information extraction in European Portuguese." Doctoral thesis, Universidade de Aveiro, 2011. http://hdl.handle.net/10773/7678.

Full text

Abstract:

Doutoramento em Engenharia Informática
The electronic storage of medical patient data is becoming a daily experience in most of the practices and hospitals worldwide. However, much of the data available is in free-form text, a convenient way of expressing concepts and events, but especially challenging if one wants to perform automatic searches, summarization or statistical analysis. Information Extraction can relieve some of these problems by offering a semantically informed interpretation and abstraction of the texts. MedInX, the Medical Information eXtraction system presented in this document, is the first information extraction system developed to process textual clinical discharge records written in Portuguese. The main goal of the system is to improve access to the information locked up in unstructured text, and, consequently, the efficiency of the health care process, by allowing faster and reliable access to quality information on health, for both patient and health professionals. MedInX components are based on Natural Language Processing principles, and provide several mechanisms to read, process and utilize external resources, such as terminologies and ontologies, in the process of automatic mapping of free text reports onto a structured representation. However, the flexible and scalable architecture of the system, also allowed its application to the task of Named Entity Recognition on a shared evaluation contest focused on Portuguese general domain free-form texts. The evaluation of the system on a set of authentic hospital discharge letters indicates that the system performs with 95% F-measure, on the task of entity recognition, and 95% precision on the task of relation extraction. Example applications, demonstrating the use of MedInX capabilities in real applications in the hospital setting, are also presented in this document. These applications were designed to answer common clinical problems related with the automatic coding of diagnoses and other health-related conditions described in the documents, according to the international classification systems ICD-9-CM and ICF. The automatic review of the content and completeness of the documents is an example of another developed application, denominated MedInX Clinical Audit system.
O armazenamento electrónico dos dados médicos do paciente é uma prática cada vez mais comum nos hospitais e clínicas médicas de todo o mundo. No entanto, a maior parte destes dados são disponibilizados sob a forma de texto livre, uma forma conveniente de expressar conceitos e termos mas particularmente desafiante quando se pretende realizar procuras, sumarização ou análise estatística de uma forma automática. As tecnologias de extracção automática de informação podem ajudar a solucionar alguns destes problemas através da interpretação semântica e da abstracção do conteúdo dos textos. O sistema de Extracção de Informação Médica apresentado neste documento, o MedInX, é o primeiro sistema desenvolvido para o processamento de cartas de alta hospitalar escritas em Português. O principal objectivo do sistema é a melhoria do acesso à informação trancada nos textos e, consequentemente, a melhoria da eficiência dos cuidados de saúde, através do acesso rápido e confiável à informação, quer relativa ao doente, quer aos profissionais de saúde. O MedInX utiliza diversas componentes, baseadas em princípios de processamento de linguagem natural, para a análise dos textos clínicos, e contém vários mecanismos para ler, processar e utilizar recursos externos, como terminologias e ontologias. Este recursos são utilizados, em particular, no mapeamento automático do texto livre para uma representação estruturada. No entanto, a arquitectura flexível e escalável do sistema permitiu, também, a sua aplicação na tarefa de Reconhecimento de Entidades Nomeadas numa avaliação conjunta relativa ao processamento de textos de domínio geral, escritos em Português. A avaliação do sistema num conjunto de cartas de alta hospitalar reais, indica que o sistema realiza a tarefa de extracção de informação com uma medida F de 95% e a tarefa de extracção de relações com uma precisão de 95%. A utilidade do sistema em aplicações reais é demonstrada através do desenvolvimento de um conjunto de projectos exemplificativos, que pretendem responder a problemas concretos e comuns em ambiente hospitalar. Estes problemas estão relacionados com a codificação automática de diagnósticos e de outras condições relacionadas com o estado de saúde do doente, seguindo as classificações internacionais, ICD-9-CM e ICF. A revisão automática do conteúdo dos documentos é outro exemplo das possíveis aplicações práticas do sistema. Esta última aplicação é representada pelo o sistema de auditoria do MedInX.

APA, Harvard, Vancouver, ISO, and other styles

34

Roberts, Angus. "Clinical information extraction : lowering the barrier." Thesis, University of Sheffield, 2012. http://etheses.whiterose.ac.uk/3254/.

Full text

Abstract:

Electronic Patient Records have opened up the possibility of re-using the data collected for clinical practice, to support both clinical practice itself, and clinical research. In order to achieve this re-use, we have to address the issue that most Electronic Patient Records make heavy use of narrative text. This thesis reports an approach to automatically extract clinically significant information from the textual component of the medical record, in order to support re-use of that record. The cost of developing such information extraction systems is currently seen to be a barrier to their deployment. We explore ways of lowering this barrier, through the separation of the linguistic, medical and engineering knowledge and skills required for development. We describe a rigorous methodology for the construction of a corpus of clinical texts semantically annotated by medical experts, and its use to automatically train a supervised machine learning-based information extraction system. We explore the re-use of existing medical knowledge in the form of terminologies, and present a way in which these terminologies can be coupled with supervised machine learning for information extraction. Finally, we consider the extent to which pre-existing software components can be used to construct a clinical IE system, and build a system that is capable of extracting clinical concepts, their properties, and the relationships between them. The resulting system shows that it is possible to achieve separation of linguistic, medical and engineering knowledge in clinical information extraction. We find that existing software frameworks are capable of some aspects of information extraction with little additional engineering work, but that they are not mature enough for the construction of a full system by the non-expert. We also find that a new cost is introduced in separating domain and linguistic knowledge, that of manual annotation by domain experts.

APA, Harvard, Vancouver, ISO, and other styles

35

Wimalasuriya, Daya Chinthana. "Use of ontologies in information extraction." Thesis, University of Oregon, 2011. http://hdl.handle.net/1794/11216.

Full text

Abstract:

xiii, 149 p. : ill. (some col.)
Information extraction (IE) aims to recognize and retrieve certain types of information from natural language text. For instance, an information extraction system may extract key geopolitical indicators about countries from a set of web pages while ignoring other types of information. IE has existed as a research field for a few decades, and ontology-based information extraction (OBIE) has recently emerged as one of its subfields. Here, the general idea is to use ontologies--which provide formal and explicit specifications of shared conceptualizations--to guide the information extraction process. This dissertation presents two novel directions for ontology-based information extraction in which ontologies are used to improve the information extraction process. First, I describe how a component-based approach for information extraction can be designed through the use of ontologies in information extraction. A key idea in this approach is identifying components of information extraction systems which make extractions with respect to specific ontological concepts. These components are termed "information extractors". The component-based approach explores how information extractors as well as other types of components can be used in developing information extraction systems. This approach has the potential to make a significant contribution towards the widespread usage and commercialization of information extraction. Second, I describe how an ontology-based information extraction system can make use of multiple ontologies. Almost all previous systems use a single ontology, although multiple ontologies are available for most domains. Using multiple ontologies in information extraction has the potential to extract more information from text and thus leads to an improvement in performance measures. The concept of information extractor, conceived in the component-based approach for information extraction, is used in designing the principles for accommodating multiple ontologies in an ontology-based information extraction system.
Committee in charge: Dr. Dejing Dou, Chair; Dr. Arthur Farley, Member; Dr. Michal Young, Member; Dr. Monte Westerfield, Outside Member

APA, Harvard, Vancouver, ISO, and other styles

36

Wang, Wei. "Unsupervised Information Extraction From Text - Extraction and Clustering of Relations between Entities." Phd thesis, Université Paris Sud - Paris XI, 2013. http://tel.archives-ouvertes.fr/tel-00998390.

Full text

Abstract:

Unsupervised information extraction in open domain gains more and more importance recently by loosening the constraints on the strict definition of the extracted information and allowing to design more open information extraction systems. In this new domain of unsupervised information extraction, this thesis focuses on the tasks of extraction and clustering of relations between entities at a large scale. The objective of relation extraction is to discover unknown relations from texts. A relation prototype is first defined, with which candidates of relation instances are initially extracted with a minimal criterion. To guarantee the validity of the extracted relation instances, a two-step filtering procedures is applied: the first step with filtering heuristics to remove efficiently large amount of false relations and the second step with statistical models to refine the relation candidate selection. The objective of relation clustering is to organize extracted relation instances into clusters so that their relation types can be characterized by the formed clusters and a synthetic view can be offered to end-users. A multi-level clustering procedure is design, which allows to take into account the massive data and diverse linguistic phenomena at the same time. First, the basic clustering groups similar relation instances by their linguistic expressions using only simple similarity measures on a bag-of-word representation for relation instances to form high-homogeneous basic clusters. Second, the semantic clustering aims at grouping basic clusters whose relation instances share the same semantic meaning, dealing with more particularly phenomena such as synonymy or more complex paraphrase. Different similarities measures, either based on resources such as WordNet or distributional thesaurus, at the level of words, relation instances and basic clusters are analyzed. Moreover, a topic-based relation clustering is proposed to consider thematic information in relation clustering so that more precise semantic clusters can be formed. Finally, the thesis also tackles the problem of clustering evaluation in the context of unsupervised information extraction, using both internal and external measures. For the evaluations with external measures, an interactive and efficient way of building reference of relation clusters proposed. The application of this method on a newspaper corpus results in a large reference, based on which different clustering methods are evaluated.

APA, Harvard, Vancouver, ISO, and other styles

37

Català, Roig Neus. "Acquiring information extraction patterns from unannotated corpora." Doctoral thesis, Universitat Politècnica de Catalunya, 2003. http://hdl.handle.net/10803/6671.

Full text

Abstract:

Information Extraction (IE) can be defined as the task of automatically extracting preespecified kind of information from a text document. The extracted information is encoded in the required format and then can be used, for example, for text summarization or as accurate index to retrieve new documents.

The main issue when building IE systems is how to obtain the knowledge needed to identify relevant information in a document. Today, IE systems are commonly based on extraction rules or IE patterns to represent the kind of information to be extracted. Most approaches to IE pattern acquisition require expert human intervention in many steps of the acquisition process. This dissertation presents a novel method for acquiring IE patterns, Essence, that significantly reduces the need for human intervention. The method is based on ELA, a specifically designed learning algorithm for acquiring IE patterns from unannotated corpora.

The distinctive features of Essence and ELA are that 1) they permit the automatic acquisition of IE patterns from unrestricted and untagged text representative of the domain, due to 2) their ability to identify regularities around semantically relevant concept-words for the IE task by 3) using non-domain-specific lexical knowledge tools such as WordNet and 4) restricting the human intervention to defining the task, and validating and typifying the set of IE patterns obtained.

Since Essence does not require a corpus annotated with the type of information to be extracted and it does makes use of a general purpose ontology and widely applied syntactic tools, it reduces the expert effort required to build an IE system and therefore also reduces the effort of porting the method to any domain.

In order to Essence be validated we conducted a set of experiments to test the performance of the method. We used Essence to generate IE patterns for a MUC-like task. Nevertheless, the evaluation procedure for MUC competitions does not provide a sound evaluation of IE systems, especially of learning systems. For this reason, we conducted an exhaustive set of experiments to further test the abilities of Essence.
The results of these experiments indicate that the proposed method is able to learn effective IE patterns.

APA, Harvard, Vancouver, ISO, and other styles

38

Carbonell, Nuñez Manuel. "Neural Information Extraction from Semi-structured Documents." Doctoral thesis, Universitat Autònoma de Barcelona, 2020. http://hdl.handle.net/10803/671583.

Full text

Abstract:

Sectors com la informació i tecnologia d'assegurances, finances i legal, processen un continu de factures, justificants, reclamacions o similar diàriament. L'èxit en l'automatització d'aquestes transaccions es basa en l'habilitat de digitalitzar correctament el contingut textual així com incorporar la comprensió semàntica. Aquest procés, conegut com Extracció d'Informació (EI) consisteix en diversos passos que són, el reconeixement de el text, la identificació d'entitats nomenades i en ocasions en reconèixer relacions entre aquestes entitats. En el nostre treball vam explorar models neurals multi-tasca a nivell d'imatge i de graf per solucionar els passos d'aquest procés de forma unificada. En el camí, vam estudiar els beneficis i inconvenients d'aquests enfocaments en comparació amb mètodes que resolen les tasques seqüencialment per separat.
Sectores como la información y tecnología de seguros, finanzas y legal, procesan un continuo de facturas, justificantes, reclamaciones o similar diariamente. El éxito en la automatización de estas transacciones se basa en la habilidad de digitalizar correctamente el contenido textual asi como incorporar la comprensión semántica. Este proceso, conococido como Extracción de Información (EI) consiste en varios pasos que son, el reconocimiento del texto, la identificación de entidades nombradas y en ocasiones en reconocer relaciones entre estas entidades. En nuestro trabajo exploramos modelos neurales multi-tarea a nivel de imagen y de grafo para solucionar los pasos de este proceso de forma unificada. En el camino, estudiamos los beneficios e inconvenientes de estos enfoques en comparación con métodos que resuelven las tareas secuencialmente por separado.
Sectors as fintech, legaltech or insurance process an inflow of million of forms, invoices, id documents, claims or similar every day. The success in the automation of these transactions depends on the ability to correctly digitize the textual content as well as to incorporate semantic understanding. This procedure, known as information extraction (IE) comprises the steps of localizing and recognizing text, identifying named entities contained in it and optionally finding relationships among its elements. In this work we explore multi-task neural models at image and graph level to solve all steps in a unified way. While doing so we find benefits and limitations of these end-to-end approaches in comparison with sequential separate methods.

APA, Harvard, Vancouver, ISO, and other styles

39

Marcińczuk, Michał. "Pattern Acquisition Methods for Information Extraction Systems." Thesis, Blekinge Tekniska Högskola, Avdelningen för programvarusystem, 2007. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-4291.

Full text

Abstract:

This master thesis treats about Event Recognition in the reports of Polish stockholders. Event Recognition is one of the Information Extraction tasks. This thesis provides a comparison of two approaches to Event Recognition: manual and automatic. In the manual approach regular expressions are used. Regular expressions are used as a baseline for the automatic approach. In the automatic approach three Machine Learning methods were applied. In the initial experiment the Decision Trees, naive Bayes and Memory Based Learning methods are compared. A modification of the standard Memory Based Learning method is presented which goal is to create a classifier that uses only positives examples in the classification task. The performance of the modified Memory Based Learning method is presented and compared to the baseline and also to other Machine Learning methods. In the initial experiment one type of annotation is used and it is the meeting date annotation. The final experiment is conducted using three types of annotations: the meeting time, the meeting date and the meeting place annotation. The experiments show that the classification can be performed using only one class of instances with the same level of performance.
(+48)669808616

APA, Harvard, Vancouver, ISO, and other styles

40

Bengtsson, Fredrik. "Algorithms for aggregate information extraction from sequences." Doctoral thesis, Luleå : Department of computer science and electrical engineering, Luleå university of technology, 2007. http://epubl.ltu.se/1402-1544/2007/25/.

Full text

APA, Harvard, Vancouver, ISO, and other styles

41

Janevski, Angel. "UniversityIE: Information Extraction From University Web Pages." UKnowledge, 2000. http://uknowledge.uky.edu/gradschool_theses/217.

Full text

Abstract:

The amount of information available on the web is growing constantly. As a result, theproblem of retrieving any desired information is getting more difficult by the day. Toalleviate this problem, several techniques are currently being used, both for locatingpages of interest and for extracting meaningful information from the retrieved pages.Information extraction (IE) is one such technology that is used for summarizingunrestricted natural language text into a structured set of facts. IE is already being appliedwithin several domains such as news transcripts, insurance information, and weatherreports. Various approaches to IE have been taken and a number of significant resultshave been reported.In this thesis, we describe the application of IE techniques to the domain of universityweb pages. This domain is broader than previously evaluated domains and has a varietyof idiosyncratic problems to address. We present an analysis of the domain of universityweb pages and the consequences of having them input to IE systems. We then presentUniversityIE, a system that can search a web site, extract relevant pages, and processthem for information such as admission requirements or general information. TheUniversityIE system, developed as part of this research, contributes three IE methods anda web-crawling heuristic that worked relatively well and predictably over a test set ofuniversity web sites.We designed UniversityIE as a generic framework for plugging in and executing IEmethods over pages acquired from the web. We also integrated in the system a genericweb crawler (built at the University of Kentucky) and ported to Java and integrated anexternal word lexicon (WordNet) and a syntax parser (Link Grammar Parser).

APA, Harvard, Vancouver, ISO, and other styles

42

Crowe, J. D. M. "Constraint based event recognition for information extraction." Thesis, University of Edinburgh, 1996. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.648968.

Full text

Abstract:

A common feature of news reports is the reference to events other than the one which is central to the discourse. Previous research has suggested Gricean explanations for this; more generally, the phenomenon has been referred to simply as "journalistic style". Whatever the underlying reasons, recent investigations into information extraction have emphasised the need for a better understanding of the mechanisms that can be used to recognise and distinguish between multiple events in discourse. Existing information extraction systems approach the problem of event recognition in a number of ways. However, although frameworks and techniques for black box evaluations of information extraction systems have been developed in recent years, almost no attention has been given to the evaluation of techniques for event recognition, despite general acknowledgement of the inadequacies of current implementations. Not only is it unclear which mechanisms are useful, but there is also little consensus as to how such mechanisms could be compared. This thesis presents a formalism for representing event structure, and introduces an evaluation metric through which a range of event recognition mechanisms are quantitatively compared. These mechanisms are implemented as modules within the CONTESS event recognition systems, and explore the use of linguistic phenomena such as temporal phrases, locative phrases and cue phrases, as well as various discourse structuring heuristics. Our results show that, whilst temporal and cue phrases are consistently useful in event recognition, locative phrases are better ignored. A number of further linguistic phenomena and heuristics are examined, providing an insight into their value for event recognition purposes.

APA, Harvard, Vancouver, ISO, and other styles

43

Sinha, Srija. "Extraction domains and information partition in Hindi." Thesis, University of York, 2002. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.274519.

Full text

APA, Harvard, Vancouver, ISO, and other styles

44

Loper, Edward (Edward Daniel) 1977. "Applying semantic relation extraction to information retrieval." Thesis, Massachusetts Institute of Technology, 2000. http://hdl.handle.net/1721.1/86521.

Full text

APA, Harvard, Vancouver, ISO, and other styles

45

Vlachos, Andreas. "Semi-supervised learning for biomedical information extraction." Thesis, University of Cambridge, 2010. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.608805.

Full text

APA, Harvard, Vancouver, ISO, and other styles

46

FRAZER, SCOTT RAYMOND. "INFORMATION EXTRACTION IN CHROMATOGRAPHY USING CORRELATION TECHNIQUES." Diss., The University of Arizona, 1985. http://hdl.handle.net/10150/187978.

Full text

Abstract:

While research into improving data quality from analytical instrumentation has gone on for decades, only recently has research been done to improve information extraction methods. One of these methods, correlation analysis, is based upon the shifting of one function relative to another and determining a correlation value for each displacement. The cross correlation algorithm allows one to compare two files and find the similarities that exist, the convolution operation combines two functions two dimensionally (e.g. any input into an analytical instrument convolves with that instrument response to give the output) and deconvolution separates functions that have convolved together. In correlation chromatography, multiple injections are made into a chromatograph at a rate which overlaps the instrument response to each injection. Injection intervals must be set to be as random as possible within limits set by peak widths and number. When the input pattern representation is deconvolved from the resulting output, the effect of that input is removed to give the instrument response to one injection. Since the operation averages all the information in the output, random noise is diminished and signal-to-noise ratios are enhanced. The most obvious application of correlation chromatography is in trace analysis. Signal-to-noise enhancements may be maximized by treating the output data (for example, with a baseline subtraction) before the deconvolution operation. System nonstationarities such as injector nonreproducibility and detector drift cause baseline or "correlation" noise, which limit attainable signal-to-noise enhancements to about half of what is theoretically possible. Correlation noise has been used to provide information about changes in system conditions. For example, a given concentration change that occurs over the course of a multiple injection sequence causes a reproducible correlation noise pattern; doubling the concentration change will double the amplitude of each point in the noise pattern. This correlation noise is much more amenable to computer analysis and, since it is still the result of signal averaging, the effect of random fluctuations and noise is reduced. A method for simulating conventional coupled column separations by means of time domain convolution of chromatograms from single column separations is presented.

APA, Harvard, Vancouver, ISO, and other styles

47

Teufel, Simone. "Argumentative zoning : information extraction from scientific text." Thesis, University of Edinburgh, 1999. http://hdl.handle.net/1842/11456.

Full text

Abstract:

We present a new type of analysis for scientific text which we call Argumentative Zoning. We demonstrate that this type of text analysis can be used for generating user-tailored and task-tailored summarises or for performing more informative citation analyses. We also demonstrate that our type of analysis can be applied to unrestricted text, both automatically and by humans. The corpus we use for the analysis (80 conference papers in computational linguistics) is a difficult test bed; it shows great variation with respect to subdomain, writing style, register and linguistic expression. We present reliability studies which we performed on this corpus and for which we used two unrelated trained annotators. The definition of our seven categories (argumentative zones) is not specific to the domain, only to the text type; it is based on the typical argumentation to be found in scientific articles. It reflects the attribution of intellectual ownership in articles, expressions of author’s stance and typical statements about problem-solving processes. On the basis of sentential features, we use a Naive Bayesian model and an ngram model over sentences to estimate a sentence’s argumentative status, taking the hand-annotated corpus as training material. An alternative, symbolic system uses the features in a rule-based way. The general working hypothesis of this thesis is that empirical discourse studies can contribute to practical document management problems: the analysis of a significant amount of naturally occurring text is essential for discourse linguistic theories, and the application of a robust discourse and argumentation analysis can make text understanding techniques for practical document management more robust.

APA, Harvard, Vancouver, ISO, and other styles

48

Tabassum, Binte Jafar Jeniya. "Information Extraction From User Generated Noisy Texts." The Ohio State University, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=osu1606315356821532.

Full text

APA, Harvard, Vancouver, ISO, and other styles

49

Au, Kwok Chung. "Information extraction for on-line job advertisements." HKBU Institutional Repository, 2004. http://repository.hkbu.edu.hk/etd_ra/525.

Full text

APA, Harvard, Vancouver, ISO, and other styles

50

Wang, Jiying. "Information extraction and integration for Web databases /." View abstract or full-text, 2004. http://library.ust.hk/cgi/db/thesis.pl?COMP%202004%20WANGJ.

Full text

Abstract:

Thesis (Ph. D.)--Hong Kong University of Science and Technology, 2004.
Includes bibliographical references (leaves 112-118). Also available in electronic version. Access restricted to campus users.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Information Extraction'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles