Dissertations / Theses: 'Automated information extraction'

1

Bowden, Paul Richard. "Automated knowledge extraction from text." Thesis, Nottingham Trent University, 1999. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.298900.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Wang, Wei. "Automated spatiotemporal and semantic information extraction for hazards." Diss., University of Iowa, 2014. https://ir.uiowa.edu/etd/1415.

Full text

Abstract:

This dissertation explores three research topics related to automated spatiotemporal and semantic information extraction about hazard events from Web news reports and other social media. The dissertation makes a unique contribution of bridging geographic information science, geographic information retrieval, and natural language processing. Geographic information retrieval and natural language processing techniques are applied to extract spatiotemporal and semantic information automatically from Web documents, to retrieve information about patterns of hazard events that are not explicitly described in the texts. Chapters 2, 3 and 4 can be regarded as three standalone journal papers. The research topics covered by the three chapters are related to each other, and are presented in a sequential way. Chapter 2 begins with an investigation of methods for automatically extracting spatial and temporal information about hazards from Web news reports. A set of rules is developed to combine the spatial and temporal information contained in the reports based on how this information is presented in text in order to capture the dynamics of hazard events (e.g., changes in event locations, new events occurring) as they occur over space and time. Chapter 3 presents an approach for retrieving semantic information about hazard events using ontologies and semantic gazetteers. With this work, information on the different kinds of events (e.g., impact, response, or recovery events) can be extracted as well as information about hazard events at different levels of detail. Using the methods presented in Chapter 2 and 3, an approach for automatically extracting spatial, temporal, and semantic information from tweets is discussed in Chapter 4. Four different elements of tweets are used for assigning appropriate spatial and temporal information to hazard events in tweets. Since tweets represent shorter, but more current information about hazards and how they are impacting a local area, key information about hazards can be retrieved through extracted spatiotemporal and semantic information from tweets.

APA, Harvard, Vancouver, ISO, and other styles

3

Heckemann, Rolf Andreas. "Automated information extraction from images of the human brain." Thesis, Imperial College London, 2007. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.444549.

Full text

APA, Harvard, Vancouver, ISO, and other styles

4

Malki, Khalil. "Automated Knowledge Extraction from Archival Documents." DigitalCommons@Robert W. Woodruff Library, Atlanta University Center, 2019. http://digitalcommons.auctr.edu/cauetds/204.

Full text

Abstract:

Traditional archival media such as paper, film, photographs, etc. contain a vast storage of knowledge. Much of this knowledge is applicable to current business and scientific problems, and offers solutions; consequently, there is value in extracting this information. While it is possible to manually extract the content, this technique is not feasible for large knowledge repositories due to cost and time. In this thesis, we develop a system that can extract such knowledge automatically from large repositories. A Graphical User Interface that permits users to indicate the location of the knowledge components (indexes) is developed, and software features that permit automatic extraction of indexes from similar documents is presented. The indexes and the documents are stored in a persistentdata store.The system is tested on a University Registrar’s legacy paper-based transcript repository. The study shows that the system provides a good solution for large-scale extraction of knowledge from archived paper and other media.

APA, Harvard, Vancouver, ISO, and other styles

5

Ortona, Stefano. "Easing information extraction on the web through automated rules discovery." Thesis, University of Oxford, 2016. https://ora.ox.ac.uk/objects/uuid:a5a7a070-338a-4afc-8be5-a38b486cf526.

Full text

Abstract:

The advent of the era of big data on the Web has made automatic web information extraction an essential tool in data acquisition processes. Unfortunately, automated solutions are in most cases more error prone than those created by humans, resulting in dirty and erroneous data. Automatic repair and cleaning of the extracted data is thus a necessary complement to information extraction on the Web. This thesis investigates the problem of inducing cleaning rules on web extracted data in order to (i) repair and align the data w.r.t. an original target schema, (ii) produce repairs that are as generic as possible such that different instances can benefit from them. The problem is addressed from three different angles: replace cross-site redundancy with an ensemble of entity recognisers; produce general repairs that can be encoded in the extraction process; and exploit entity-wide relations to infer common knowledge on extracted data. First, we present ROSeAnn, an unsupervised approach to integrate semantic annotators and produce a unied and consistent annotation layer on top of them. Both the diversity in vocabulary and widely varying accuracy justify the need for middleware that reconciles different annotator opinions. Considering annotators as "black-boxes" that do not require per-domain supervision allows us to recognise semantically related content in web extracted data in a scalable way. Second, we show in WADaR how annotators can be used to discover rules to repair web extracted data. We study the problem of computing joint repairs for web data extraction programs and their extracted data, providing an approximate solution that requires no per-source supervision and proves effective across a wide variety of domains and sources. The proposed solution is effective not only in repairing the extracted data, but also in encoding such repairs in the original extraction process. Third, we investigate how relationships among entities can be exploited to discover inconsistencies and additional information. We present RuDiK, a disk-based scalable solution to discover first-order logic rules over RDF knowledge bases built from web sources. We present an approach that does not limit its search space to rules that rely on "positive" relationships between entities, as in the case with traditional mining of constraints. On the contrary, it extends the search space to also discover negative rules, i.e., patterns that lead to contradictions in the data.

APA, Harvard, Vancouver, ISO, and other styles

6

Ademi, Muhamet. "adXtractor – Automated and Adaptive Generation of Wrappers for Information Retrieval." Thesis, Malmö högskola, Fakulteten för teknik och samhälle (TS), 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:mau:diva-20071.

Full text

Abstract:

The aim of this project is to investigate the feasibility of retrieving unstructured automotive listings from structured web pages on the Internet. The research has two major purposes: (1) to investigate whether it is feasible to pair information extraction algorithms and compute wrappers (2) demonstrate the results of pairing these techniques and evaluate the measurements. We merge two training sets available on the web to construct reference sets which is the basis for the information extraction. The wrappers are computed by using information extraction techniques to identify data properties with a variety of techniques such as fuzzy string matching, regular expressions and document tree analysis. The results demonstrate that it is possible to pair these techniques successfully and retrieve the majority of the listings. Additionally, the findings also suggest that many platforms utilise lazy loading to populate image resources which the algorithm is unable to capture. In conclusion, the study demonstrated that it is possible to use information extraction to compute wrappers dynamically by identifying data properties. Furthermore, the study demonstrates the ability to open non-queryable domain data through a unified service.

APA, Harvard, Vancouver, ISO, and other styles

7

Xhemali, Daniela. "Automated retrieval and extraction of training course information from unstructured web pages." Thesis, Loughborough University, 2010. https://dspace.lboro.ac.uk/2134/7022.

Full text

Abstract:

Web Information Extraction (WIE) is the discipline dealing with the discovery, processing and extraction of specific pieces of information from semi-structured or unstructured web pages. The World Wide Web comprises billions of web pages and there is much need for systems that will locate, extract and integrate the acquired knowledge into organisations practices. There are some commercial, automated web extraction software packages, however their success comes from heavily involving their users in the process of finding the relevant web pages, preparing the system to recognise items of interest on these pages and manually dealing with the evaluation and storage of the extracted results. This research has explored WIE, specifically with regard to the automation of the extraction and validation of online training information. The work also includes research and development in the area of automated Web Information Retrieval (WIR), more specifically in Web Searching (or Crawling) and Web Classification. Different technologies were considered, however after much consideration, Naïve Bayes Networks were chosen as the most suitable for the development of the classification system. The extraction part of the system used Genetic Programming (GP) for the generation of web extraction solutions. Specifically, GP was used to evolve Regular Expressions, which were then used to extract specific training course information from the web such as: course names, prices, dates and locations. The experimental results indicate that all three aspects of this research perform very well, with the Web Crawler outperforming existing crawling systems, the Web Classifier performing with an accuracy of over 95% and a precision of over 98%, and the Web Extractor achieving an accuracy of over 94% for the extraction of course titles and an accuracy of just under 67% for the extraction of other course attributes such as dates, prices and locations. Furthermore, the overall work is of great significance to the sponsoring company, as it simplifies and improves the existing time-consuming, labour-intensive and error-prone manual techniques, as will be discussed in this thesis. The prototype developed in this research works in the background and requires very little, often no, human assistance.

APA, Harvard, Vancouver, ISO, and other styles

8

Hedbrant, Per. "Towards a fully automated extraction and interpretation of tabular data using machine learning." Thesis, Uppsala universitet, Avdelningen för systemteknik, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-391490.

Full text

Abstract:

Motivation A challenge for researchers at CBCS is the ability to efficiently manage the different data formats that frequently are changed. This handling includes import of data into the same format, regardless of the output of the various instruments used. There are commercial solutions available for this process, but to our knowledge, all these require prior generation of templates to which data must conform.A challenge for researchers at CBCS is the ability to efficiently manage the different data formats that frequently are changed. Significant amount of time is spent on manual pre- processing, converting from one format to another. There are currently no solutions that uses pattern recognition to locate and automatically recognise data structures in a spreadsheet. Problem Definition The desired solution is to build a self-learning Software as-a-Service (SaaS) for automated recognition and loading of data stored in arbitrary formats. The aim of this study is three-folded: A) Investigate if unsupervised machine learning methods can be used to label different types of cells in spreadsheets. B) Investigate if a hypothesis-generating algorithm can be used to label different types of cells in spreadsheets. C) Advise on choices of architecture and technologies for the SaaS solution. Method A pre-processing framework is built that can read and pre-process any type of spreadsheet into a feature matrix. Different datasets are read and clustered. An investigation on the usefulness of reducing the dimensionality is also done. A hypothesis-driven algorithm is built and adapted to two of the data formats CBCS uses most frequently. Discussions are held on choices of architecture and technologies for the SaaS solution, including system design patterns, web development framework and database. Result The reading and pre-processing framework is in itself a valuable result, due to its general applicability. No satisfying results are found when using mini-batch K means clustering method. When only reading data from one format, the dimensionality can be reduced from 542 to around 40 dimensions. The hypothesis-driven algorithm can consistently interpret the format it is designed for. More work is needed to make it more general. Implication The study contribute to the desired solution in short-term by the hypothesis-generating algorithm, and in a more generalisable way by the unsupervised learning approach. The study also contributes by initiating a conversation around the system design choices.

APA, Harvard, Vancouver, ISO, and other styles

9

Sahar, Liora. "Using remote-sensing and gis technology for automated building extraction." Diss., Georgia Institute of Technology, 2009. http://hdl.handle.net/1853/37231.

Full text

Abstract:

Extraction of buildings from remote sensing sources is an important GIS application and has been the subject of extensive research over the last three decades. An accurate building inventory is required for applications such as GIS database maintenance and revision; impervious surfaces mapping; storm water management; hazard mitigation and risk assessment. Despite all the progress within the fields of photogrammetry and image processing, the problem of automated feature extraction is still unresolved. A methodology for automatic building extraction that integrates remote sensing sources and GIS data was proposed. The methodology consists of a series of image processing and spatial analysis techniques. It incorporates initial simplification procedure and multiple feature analysis components. The extraction process was implemented and tested on three distinct types of buildings including commercial, residential and high-rise. Aerial imagery and GIS data from Shelby County, Tennessee were identified for the testing and validation of the results. The contribution of each component to the overall methodology was quantitatively evaluated as relates to each type of building. The automatic process was compared to manual building extraction and provided means to alleviate the manual procedure effort. A separate module was implemented to identify the 2D shape of a building. Indices for two specific shapes were developed based on the moment theory. The indices were tested and evaluated on multiple feature segments and proved to be successful. The research identifies the successful building extraction scenarios as well as the challenges, difficulties and drawbacks of the process. Recommendations are provided based on the testing and evaluation for future extraction projects.

APA, Harvard, Vancouver, ISO, and other styles

10

Nepal, Madhav Prasad. "Automated extraction and querying of construction-specific design features from a building information model." Thesis, University of British Columbia, 2011. http://hdl.handle.net/2429/38046.

Full text

Abstract:

In recent years, several research and industry efforts have focused on developing building information models (BIMs) to support various aspects of the architectural, engineering, construction and facility management (AEC/FM) industry. BIMs provide semantically-rich information models that explicitly represent both 3D geometric and non-geometric information. While BIMs have many useful applications to the construction industry, there are enormous challenges in getting construction-specific information out of BIMs, limiting the usability of these models. This research addresses this problem by developing a novel approach to extract construction features from a given BIM and support the processing of user-driven queries on a BIM. In this dissertation, we formalized: (i) An ontology of design features that explicitly represents design conditions that are relevant to construction practitioners and supports the generation of a construction-specific feature-based model; (ii) A query specification vocabulary which characterizes spatial and non-spatial queries, and developed query templates to guide non-expert BIM users to specify queries; and (iii) An integrated approach that combines model-based reasoning and query-based approach to automatically extract design features to create a project-specific feature-based model (FBM) and provide support for answering queries on the FBM. The construction knowledge formalized in this research was gathered from a variety of sources, which included a detailed literature review, several case studies, extensive observations of design and construction meetings, and lengthy discussions with different construction practitioners. We used three different tests to validate the research contributions. We conducted semi-structured, informal interviews with four construction experts for the four building projects studied to validate the content, representativeness and the generality of the concepts formalized in this research. We conducted retrospective analysis for different features to evaluate the soundness of our research in comparison with the state-of-the-art tools. Finally, we performed descriptive and interpretive analysis to demonstrate that our approach is capable of providing richer, insightful and useful construction information. This research can help to make a BIM more accessible for construction users. The developed solutions can support decision making in a variety of construction management functions, such as cost estimating, construction planning, execution and coordination, purchasing, constructability analysis, methods selection, and productivity analysis.

APA, Harvard, Vancouver, ISO, and other styles

11

Slabber, Frans Bresler. "Semi-automated extraction of structural orientation data from aerospace imagery combined with digital elevation models." Thesis, Rhodes University, 1996. http://hdl.handle.net/10962/d1005614.

Full text

Abstract:

A computer-based method for determining the orientation of planar geological structures from remotely sensed images, utilizing digital geological images and digital elevation models (DEMs), is developed and assessed. The method relies on operator skill and experience to recognize geological structure traces on images, and then employs software routines (GEOSTRUC©) to calculate the orientation of selected structures. The operator selects three points on the trace of a planar geological feature as seen on a digital geological image that is co registered with a DEM of the same area. The orientation of the plane that contains the three points is determined using vector algebra equations. The program generates an ASCII data file which contains the orientation data as well as the geographical location of the measurements. This ASCII file can then be utilized in further analysis of the orientation data. The software development kit (SDK) for TNTmips v5.00, from MicroImages Inc. and operating in the X Windows environment, was employed to construct the software. The Watcom C\C++ Development Environment was used to generate the executable program, GEOSTRUC© . GEOSTRUC© was tested in two case studies. The case studies utilized digital data derived from the use of different techniques and from different sources which varied in scale and resolution. This was done to illustrate the versatility of the program and its application to a wide range of data types. On the whole, the results obtained using the GEOSTRUC© analyses compare favourably to field data from each test area. Use of the method to determine the orientation of axial planes in the case study revealed the usefulness of the method as a powerful analytic tool for use on a macroscopic scale. The method should not he applied in area with low variation in relief as the method proved to be less accurate in these areas. Advancements in imaging technology will serve to create images with better resolution, which will, in turn, improve the overall accuracy of the method.

APA, Harvard, Vancouver, ISO, and other styles

12

Kim, Kee-Tae. "Satellite mapping and automated feature extraction: geographic information system-based change detection of the Antarctic coast." The Ohio State University, 2004. http://rave.ohiolink.edu/etdc/view?acc_num=osu1072898409.

Full text

APA, Harvard, Vancouver, ISO, and other styles

13

Kim, Kee-Tae. "Satellite mapping and automated feature extraction geographic information system-based change detection of the Antarctic coast /." Connect to this title online, 2003. http://rave.ohiolink.edu/etdc/view?acc%5Fnum=osu1072898409.

Full text

Abstract:

Thesis (Ph. D.)--Ohio State University, 2003.
Title from first page of PDF file. Document formatted into pages; contains xiv, 157 p.; also includes graphics. Includes bibliographical references (p. 143-148).

APA, Harvard, Vancouver, ISO, and other styles

14

Cleve, Oscar, and Sara Gustafsson. "Automatic Feature Extraction for Human Activity Recognitionon the Edge." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-260247.

Full text

Abstract:

This thesis evaluates two methods for automatic feature extraction to classify the accelerometer data of periodic and sporadic human activities. The first method selects features using individual hypothesis tests and the second one is using a random forest classifier as an embedded feature selector. The hypothesis test was combined with a correlation filter in this study. Both methods used the same initial pool of automatically generated time series features. A decision tree classifier was used to perform the human activity recognition task for both methods.The possibility of running the developed model on a processor with limited computing power was taken into consideration when selecting methods for evaluation. The classification results showed that the random forest method was good at prioritizing among features. With 23 features selected it had a macro average F1 score of 0.84 and a weighted average F1 score of 0.93. The first method, however, only had a macro average F1 score of 0.40 and a weighted average F1 score of 0.63 when using the same number of features. In addition to the classification performance this thesis studies the potential business benefits that automation of feature extractioncan result in.
Denna studie utvärderar två metoder som automatiskt extraherar features för att klassificera accelerometerdata från periodiska och sporadiska mänskliga aktiviteter. Den första metoden väljer features genom att använda individuella hypotestester och den andra metoden använder en random forest-klassificerare som en inbäddad feature-väljare. Hypotestestmetoden kombinerades med ett korrelationsfilter i denna studie. Båda metoderna använde samma initiala samling av automatiskt genererade features. En decision tree-klassificerare användes för att utföra klassificeringen av de mänskliga aktiviteterna för båda metoderna. Möjligheten att använda den slutliga modellen på en processor med begränsad hårdvarukapacitet togs i beaktning då studiens metoder valdes. Klassificeringsresultaten visade att random forest-metoden hade god förmåga att prioritera bland features. Med 23 utvalda features erhölls ett makromedelvärde av F1 score på 0,84 och ett viktat medelvärde av F1 score på 0,93. Hypotestestmetoden resulterade i ett makromedelvärde av F1 score på 0,40 och ett viktat medelvärde av F1 score på 0,63 då lika många features valdes ut. Utöver resultat kopplade till klassificeringsproblemet undersöker denna studie även potentiella affärsmässiga fördelar kopplade till automatisk extrahering av features.

APA, Harvard, Vancouver, ISO, and other styles

15

Li, Yang [Verfasser], Gunter [Gutachter] Saake, and Andreas [Gutachter] Nürnberger. "Automated extraction of feature and variability information from natural language requirement specifications / Yang Li ; Gutachter: Gunter Saake, Andreas Nürnberger." Magdeburg : Universitätsbibliothek Otto-von-Guericke-Universität, 2020. http://d-nb.info/1226932002/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

16

Tate, Calandra Rilette. "An investigation of the relationship between automated machine translation evaluation metrics and user performance on an information extraction task." College Park, Md.: University of Maryland, 2007. http://hdl.handle.net/1903/7777.

Full text

Abstract:

Thesis (Ph. D.) -- University of Maryland, College Park, 2007.
Thesis research directed by: Applied Mathematics Program . Title from t.p. of PDF. Includes bibliographical references. Published by UMI Dissertation Services, Ann Arbor, Mich. Also available in paper.

APA, Harvard, Vancouver, ISO, and other styles

17

Mao, Jin, Lisa R. Moore, Carrine E. Blank, Elvis Hsin-Hui Wu, Marcia Ackerman, Sonali Ranade, and Hong Cui. "Microbial phenomics information extractor (MicroPIE): a natural language processing tool for the automated acquisition of prokaryotic phenotypic characters from text sources." BIOMED CENTRAL LTD, 2016. http://hdl.handle.net/10150/622562.

Full text

Abstract:

Background: The large-scale analysis of phenomic data (i.e., full phenotypic traits of an organism, such as shape, metabolic substrates, and growth conditions) in microbial bioinformatics has been hampered by the lack of tools to rapidly and accurately extract phenotypic data from existing legacy text in the field of microbiology. To quickly obtain knowledge on the distribution and evolution of microbial traits, an information extraction system needed to be developed to extract phenotypic characters from large numbers of taxonomic descriptions so they can be used as input to existing phylogenetic analysis software packages. Results: We report the development and evaluation of Microbial Phenomics Information Extractor (MicroPIE, version 0.1.0). MicroPIE is a natural language processing application that uses a robust supervised classification algorithm (Support Vector Machine) to identify characters from sentences in prokaryotic taxonomic descriptions, followed by a combination of algorithms applying linguistic rules with groups of known terms to extract characters as well as character states. The input to MicroPIE is a set of taxonomic descriptions (clean text). The output is a taxon-by-character matrix-with taxa in the rows and a set of 42 pre-defined characters (e.g., optimum growth temperature) in the columns. The performance of MicroPIE was evaluated against a gold standard matrix and another student-made matrix. Results show that, compared to the gold standard, MicroPIE extracted 21 characters (50%) with a Relaxed F1 score > 0.80 and 16 characters (38%) with Relaxed F1 scores ranging between 0.50 and 0.80. Inclusion of a character prediction component (SVM) improved the overall performance of MicroPIE, notably the precision. Evaluated against the same gold standard, MicroPIE performed significantly better than the undergraduate students. Conclusion: MicroPIE is a promising new tool for the rapid and efficient extraction of phenotypic character information from prokaryotic taxonomic descriptions. However, further development, including incorporation of ontologies, will be necessary to improve the performance of the extraction for some character types.

APA, Harvard, Vancouver, ISO, and other styles

18

Deshpande, Sagar Shriram. "Semi-automated Methods to Create a Hydro-flattened DEM using Single Photon and Linear Mode LiDAR Points." The Ohio State University, 2017. http://rave.ohiolink.edu/etdc/view?acc_num=osu1491300120665946.

Full text

APA, Harvard, Vancouver, ISO, and other styles

19

Wächter, Thomas. "Semi-automated Ontology Generation for Biocuration and Semantic Search." Doctoral thesis, Saechsische Landesbibliothek- Staats- und Universitaetsbibliothek Dresden, 2011. http://nbn-resolving.de/urn:nbn:de:bsz:14-qucosa-64838.

Full text

Abstract:

Background: In the life sciences, the amount of literature and experimental data grows at a tremendous rate. In order to effectively access and integrate these data, biomedical ontologies – controlled, hierarchical vocabularies – are being developed. Creating and maintaining such ontologies is a difficult, labour-intensive, manual process. Many computational methods which can support ontology construction have been proposed in the past. However, good, validated systems are largely missing. Motivation: The biocuration community plays a central role in the development of ontologies. Any method that can support their efforts has the potential to have a huge impact in the life sciences. Recently, a number of semantic search engines were created that make use of biomedical ontologies for document retrieval. To transfer the technology to other knowledge domains, suitable ontologies need to be created. One area where ontologies may prove particularly useful is the search for alternative methods to animal testing, an area where comprehensive search is of special interest to determine the availability or unavailability of alternative methods. Results: The Dresden Ontology Generator for Directed Acyclic Graphs (DOG4DAG) developed in this thesis is a system which supports the creation and extension of ontologies by semi-automatically generating terms, definitions, and parent-child relations from text in PubMed, the web, and PDF repositories. The system is seamlessly integrated into OBO-Edit and Protégé, two widely used ontology editors in the life sciences. DOG4DAG generates terms by identifying statistically significant noun-phrases in text. For definitions and parent-child relations it employs pattern-based web searches. Each generation step has been systematically evaluated using manually validated benchmarks. The term generation leads to high quality terms also found in manually created ontologies. Definitions can be retrieved for up to 78% of terms, child ancestor relations for up to 54%. No other validated system exists that achieves comparable results. To improve the search for information on alternative methods to animal testing an ontology has been developed that contains 17,151 terms of which 10% were newly created and 90% were re-used from existing resources. This ontology is the core of Go3R, the first semantic search engine in this field. When a user performs a search query with Go3R, the search engine expands this request using the structure and terminology of the ontology. The machine classification employed in Go3R is capable of distinguishing documents related to alternative methods from those which are not with an F-measure of 90% on a manual benchmark. Approximately 200,000 of the 19 million documents listed in PubMed were identified as relevant, either because a specific term was contained or due to the automatic classification. The Go3R search engine is available on-line under www.Go3R.org.

APA, Harvard, Vancouver, ISO, and other styles

20

Munnecom, Lorenna, and Miguel Chaves de Lemos Pacheco. "Exploration of an Automated Motivation Letter Scoring System to Emulate Human Judgement." Thesis, Högskolan Dalarna, Mikrodataanalys, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:du-34563.

Full text

Abstract:

As the popularity of the master’s in data science at Dalarna University increases, so does the number of applicants. The aim of this thesis was to explore different approaches to provide an automated motivation letter scoring system which could emulate the human judgement and automate the process of candidate selection. Several steps such as image processing and text processing were required to enable the authors to retrieve numerous features which could lead to the identification of the factors graded by the program managers. Grammatical based features and Advanced textual features were extracted from the motivation letters followed by the application of Topic Modelling methods to extract the probability of each topics occurring within a motivation letter. Furthermore, correlation analysis was applied to quantify the association between the features and the different factors graded by the program managers, followed by Ordinal Logistic Regression and Random Forest to build models with the most impactful variables. Finally, Naïve Bayes Algorithm, Random Forest and Support Vector Machine were used, first for classification and then for prediction purposes. These results were not promising as the factors were not accurately identified. Nevertheless, the authors suspected that the factors may be strongly related to the highlight of specific topics within a motivation letter which can lead to further research.

APA, Harvard, Vancouver, ISO, and other styles

21

Collier, Robin. "Automatic template creation for information extraction." Thesis, University of Sheffield, 1998. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.286986.

Full text

APA, Harvard, Vancouver, ISO, and other styles

22

Joseph, Daniel. "Linking information resources with automatic semantic extraction." Thesis, University of Manchester, 2016. https://www.research.manchester.ac.uk/portal/en/theses/linking-information-resources-with-automatic-semantic-extraction(ada2db36-4366-441a-a0a9-d76324a77e2c).html.

Full text

Abstract:

Knowledge is a critical dimension in the problem solving processes of human intelligence. Consequently, enabling intelligent systems to provide advanced services requires that their artificial intelligence routines have access to knowledge of relevant domains. Ontologies are often utilised as the formal conceptualisation of domains, in that they identify and model the concepts and relationships of the targeted domain. However complexities inherent in ontology development and maintenance have limited their availability. Separate from the conceptualisation component, domain knowledge also encompasses the concept membership of object instances within the domain. The need to capture both the domain model and the current state of instances within the domain has motivated the import of Formal Concept Analysis into intelligent systems research. Formal Concept Analysis, which provides a simplified model of a domain, has the advantage in that not only does it define concepts in terms of their attribute description but object instances are simultaneously ascribed to their appropriate concepts. Nonetheless, a significant drawback of Formal Concept Analysis is that when applied to a large dataset, the lattice with which it models a domain is often composed of a copious amount of concepts, many of which are arguably unnecessary or invalid. In this research a novel measure is introduced which assigns a relevance value to concepts in the lattice. This measure is termed the Collapse Index and is based on the minimum number of object instances that need be removed from a domain in order for a concept to be expunged from the lattice. Mathematics that underpin its origin and behaviour are detailed in the thesis showing that if the relevance of a concept is defined by the Collapse Index: a concept will eventually lose relevance if one of its immediate subconcepts increasingly acquires object instance support; and a concept has its highest relevance when its immediate subconcepts have equal or near equal object instance support. In addition, experimental evaluation is provided where the Collapse Index demonstrated comparable or better performance than the current prominent alternatives in: being consistent across samples; the ability to recall concepts in noisy lattices; and efficiency of calculation. It is also demonstrated that the Collapse Index affords concepts with low object instance support the opportunity to have a higher relevance than those of high supportThe second contribution to knowledge is that of an approach to semantic extraction from a dataset where the Collapse Index is included as a method of selecting concepts for inclusion in a final concept hierarchy. The utility of the approach is demonstrated by reviewing its inclusion in the implementation of a recommender system. This recommender system serves as the final contribution featuring a unique design where lattices represent user profiles and concepts in these profiles are pruned using the Collapse Index. Results showed that pruning of profile lattices enabled by the Collapse Index improved the success levels of movie recommendations if the appropriate thresholds are set.

APA, Harvard, Vancouver, ISO, and other styles

23

Jimeno, Yepes Antonio José. "Ontology refinement for improved information retrieval in the biomedical domain." Doctoral thesis, Universitat Jaume I, 2009. http://hdl.handle.net/10803/384552.

Full text

Abstract:

Este trabajo de tesis doctoral se centra en el uso de ontologías de dominio y su refinamiento enfocado a la recuperación de la información. El dominio seleccionado ha sido el de la Biomedicina, que dispone de una extensa colección de resúmenes en la base de datos Medline y recursos que facilitan la creación de ontologías muy extensas, tales como MeSH o UMLS. En este trabajo se ha desarrollado también un modelo de formulación de consulta que permite relacionar un modelo de documento con una ontología dentro de los modelos de lenguaje. Además hemos desarrollado un algoritmo que permite mejorar la ontología para la tarea de recuperación de la información a partir de recursos no estructurados. Los resultados muestran que el refinamiento de las ontologías aplicado a la recuperación de la información mejora el rendimiento, identificando automáticamente información no presente en la ontología. Además hemos comprobado que el tipo de contenido relevante para las consultas depende de propiedades relacionadas con el tipo de consulta y la colección de documentos. Los resultados están acordes con resultados existentes en el campo de la recuperación de la información.

APA, Harvard, Vancouver, ISO, and other styles

24

Barry, Ousmane. "Semi-Automatic Extraction of Information from Satellite Images." Thesis, KTH, Ljud- och bildbehandling, 2011. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-55351.

Full text

Abstract:

This master thesis project deals with the semi-automatic extraction of information on satellite images. Some Geographic information systems (GIS) are dedicated to the issue of data production. The graphical user interface of these GIS is essentially passive, and only provides basic CAD tools for intelligence information mapping such as geometric and semantic capture of spatial objects and semantics improvement of geographic objects. As well as CAD software, they improve the operator productivity in certain limits that of ergonomics. Thus, by combining some generic image processing algorithms, we have implemented a component of semi-automatic extraction of features on satellite images. We gave a priority on the interaction between a user and the component. The user will be only focused on theinterpretation of the images and the component will perform the repetitive task for him. The addressed features were suburban roads, hydrographic area boundaries and shorelines. This system based on powerful tools such as the Orfeo Toolbox (core of the) and Qt (for the GUI) has been tested on images from different stellites and the results are quite satisfactory. This opens perspectives to improve and optimize this system in the aim to integrate it into a GIS solution.

APA, Harvard, Vancouver, ISO, and other styles

25

del, Aguila Pla Pol. "Normalization of Remote Sensing Imagery for Automatic Information Extraction." Thesis, KTH, Kommunikationsteori, 2014. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-144032.

Full text

Abstract:

For the time being, Remote Sensing automatized techniques are conventionally designed to be used exclusively on data captured by aparticular sensor system. This convention was only adopted after evidence suggested that, in the field, algorithms that yield great resultson data from one specific satellite or sensor, tend to underachieve on data from similar sensors. With this effect in mind, we will refer to remote sensing imagery as heterogeneous.There have been attempts to compensate every effect on the data and obtain the underlying physical property that carries the information, the ground reflectance. Because of their improvement of the informative value of each image, some of them have even been standardized as common preprocessing methods. However, these techniques generally require further knowledge on certain atmospheric properties at the time the data was captured. This information is generally not available and has to be estimated or guessed by experts, avery time consuming, inaccurate and expensive task. Moreover, even if the results do improve in each of the treated images, a significant decrease of their heterogeneity is not achieved. There have been more automatized proposals to treat the data in the literature, which have been broadly named RRN (Relative Radiometric Normalization) algorithms. These consider the problem of heterogeneity itself and use properties strictly related to the statistics of remote sensing imagery to solve it. In this master thesis, an automatic algorithm to reduce heterogeneity in generic imagery is designed, characterized and evaluated through crossed classification results on remote sensing imagery.

APA, Harvard, Vancouver, ISO, and other styles

26

Harte, Christopher. "Towards automatic extraction of harmony information from music signals." Thesis, Queen Mary, University of London, 2010. http://qmro.qmul.ac.uk/xmlui/handle/123456789/534.

Full text

Abstract:

In this thesis we address the subject of automatic extraction of harmony information from audio recordings. We focus on chord symbol recognition and methods for evaluating algorithms designed to perform that task. We present a novel six-dimensional model for equal tempered pitch space based on concepts from neo-Riemannian music theory. This model is employed as the basis of a harmonic change detection function which we use to improve the performance of a chord recognition algorithm. We develop a machine readable text syntax for chord symbols and present a hand labelled chord transcription collection of 180 Beatles songs annotated using this syntax. This collection has been made publicly available and is already widely used for evaluation purposes in the research community. We also introduce methods for comparing chord symbols which we subsequently use for analysing the statistics of the transcription collection. To ensure that researchers are able to use our transcriptions with confidence, we demonstrate a novel alignment algorithm based on simple audio fingerprints that allows local copies of the Beatles audio files to be accurately aligned to our transcriptions automatically. Evaluation methods for chord symbol recall and segmentation measures are discussed in detail and we use our chord comparison techniques as the basis for a novel dictionary-based chord symbol recall calculation. At the end of the thesis, we evaluate the performance of fifteen chord recognition algorithms (three of our own and twelve entrants to the 2009 MIREX chord detection evaluation) on the Beatles collection. Results are presented for several different evaluation measures using a range of evaluation parameters. The algorithms are compared with each other in terms of performance but we also pay special attention to analysing and discussing the benefits and drawbacks of the different evaluation methods that are used.

APA, Harvard, Vancouver, ISO, and other styles

27

Mason, Oliver Jan. "The automatic extraction of linguistic information from text corpora." Thesis, University of Birmingham, 2006. http://etheses.bham.ac.uk//id/eprint/116/.

Full text

Abstract:

This is a study exploring the feasibility of a fully automated analysis of linguistic data. It identifies a requirement for large-scale investigations, which cannot be done manually by a human researcher. Instead, methods from natural language processing are suggested as a way to analyse large amounts of corpus data without any human intervention. Human involvement hinders scalability and introduces a bias which prevents studies from being completely replicable. The fundamental assumption underlying this work is that linguistic analysis must be empirical, and that reliance on existing theories or even descriptive categories should be avoided as far as possible. In this thesis we report the results of a number of case studies investigating various areas of language description, lexis, grammar, and meaning. The aim of these case studies is to see how far we can automate the analysis of different aspects of language, both with data gathering and subsequent processing of the data. The outcomes of the feasibility studies demonstrate the practicability of such automated analyses.

APA, Harvard, Vancouver, ISO, and other styles

28

Palmer, David Donald. "Modeling uncertainty for information extraction from speech data /." Thesis, Connect to this title online; UW restricted, 2001. http://hdl.handle.net/1773/5834.

Full text

APA, Harvard, Vancouver, ISO, and other styles

29

Frunza, Oana Magdalena. "Personalized Medicine through Automatic Extraction of Information from Medical Texts." Thèse, Université d'Ottawa / University of Ottawa, 2012. http://hdl.handle.net/10393/22724.

Full text

Abstract:

The wealth of medical-related information available today gives rise to a multidimensional source of knowledge. Research discoveries published in prestigious venues, electronic-health records data, discharge summaries, clinical notes, etc., all represent important medical information that can assist in the medical decision-making process. The challenge that comes with accessing and using such vast and diverse sources of data stands in the ability to distil and extract reliable and relevant information. Computer-based tools that use natural language processing and machine learning techniques have proven to help address such challenges. This current work proposes automatic reliable solutions for solving tasks that can help achieve a personalized-medicine, a medical practice that brings together general medical knowledge and case-specific medical information. Phenotypic medical observations, along with data coming from test results, are not enough when assessing and treating a medical case. Genetic, life-style, background and environmental data also need to be taken into account in the medical decision process. This thesis’s goal is to prove that natural language processing and machine learning techniques represent reliable solutions for solving important medical-related problems. From the numerous research problems that need to be answered when implementing personalized medicine, the scope of this thesis is restricted to four, as follows: 1. Automatic identification of obesity-related diseases by using only textual clinical data; 2. Automatic identification of relevant abstracts of published research to be used for building systematic reviews; 3. Automatic identification of gene functions based on textual data of published medical abstracts; 4. Automatic identification and classification of important medical relations between medical concepts in clinical and technical data. This thesis investigation on finding automatic solutions for achieving a personalized medicine through information identification and extraction focused on individual specific problems that can be later linked in a puzzle-building manner. A diverse representation technique that follows a divide-and-conquer methodological approach shows to be the most reliable solution for building automatic models that solve the above mentioned tasks. The methodologies that I propose are supported by in-depth research experiments and thorough discussions and conclusions.

APA, Harvard, Vancouver, ISO, and other styles

30

Chen, Hsinchun, Joanne Martinez, Amy Kirchhoff, Tobun Dorbin Ng, and Bruce R. Schatz. "Alleviating Search Uncertainty through Concept Associations: Automatic Indexing, Co-Occurrence Analysis, and Parallel Computing." Wiley Periodicals, Inc, 1998. http://hdl.handle.net/10150/106252.

Full text

Abstract:

Artificial Intelligence Lab, Department of MIS, University of Arizona
In this article, we report research on an algorithmic approach to alleviating search uncertainty in a large information space. Grounded on object filtering, automatic indexing, and co-occurrence analysis, we performed a large-scale experiment using a parallel supercomputer (SGI Power Challenge) to analyze 400,000/ abstracts in an INSPEC computer engineering collection. Two system-generated thesauri, one based on a combined object filtering and automatic indexing method, and the other based on automatic indexing only, were compared with the human-generated INSPEC subject thesaurus. Our user evaluation revealed that the system-generated thesauri were better than the INSPEC thesaurus in concept recall, but in concept precision the 3 thesauri were comparable. Our analysis also revealed that the terms suggested by the 3 thesauri were complementary and could be used to significantly increase â â varietyâ â in search terms and thereby reduce search uncertainty.

APA, Harvard, Vancouver, ISO, and other styles

31

Gorinski, Philip John. "Automatic movie analysis and summarisation." Thesis, University of Edinburgh, 2018. http://hdl.handle.net/1842/31053.

Full text

Abstract:

Automatic movie analysis is the task of employing Machine Learning methods to the field of screenplays, movie scripts, and motion pictures to facilitate or enable various tasks throughout the entirety of a movie’s life-cycle. From helping with making informed decisions about a new movie script with respect to aspects such as its originality, similarity to other movies, or even commercial viability, all the way to offering consumers new and interesting ways of viewing the final movie, many stages in the life-cycle of a movie stand to benefit from Machine Learning techniques that promise to reduce human effort, time, or both. Within this field of automatic movie analysis, this thesis addresses the task of summarising the content of screenplays, enabling users at any stage to gain a broad understanding of a movie from greatly reduced data. The contributions of this thesis are four-fold: (i)We introduce ScriptBase, a new large-scale data set of original movie scripts, annotated with additional meta-information such as genre and plot tags, cast information, and log- and tag-lines. To our knowledge, Script- Base is the largest data set of its kind, containing scripts and information for almost 1,000 Hollywood movies. (ii) We present a dynamic summarisation model for the screenplay domain, which allows for extraction of highly informative and important scenes from movie scripts. The extracted summaries allow for the content of the original script to stay largely intact and provide the user with its important parts, while greatly reducing the script-reading time. (iii) We extend our summarisation model to capture additional modalities beyond the screenplay text. The model is rendered multi-modal by introducing visual information obtained from the actual movie and by extracting scenes from the movie, allowing users to generate visual summaries of motion pictures. (iv) We devise a novel end-to-end neural network model for generating natural language screenplay overviews. This model enables the user to generate short descriptive and informative texts that capture certain aspects of a movie script, such as its genres, approximate content, or style, allowing them to gain a fast, high-level understanding of the screenplay. Multiple automatic and human evaluations were carried out to assess the performance of our models, demonstrating that they are well-suited for the tasks set out in this thesis, outperforming strong baselines. Furthermore, the ScriptBase data set has started to gain traction, and is currently used by a number of other researchers in the field to tackle various tasks relating to screenplays and their analysis.

APA, Harvard, Vancouver, ISO, and other styles

32

Aslam, Irfan. "Semantic frame based automatic extraction of typological information from descriptive grammars." Thesis, Högskolan i Skövde, Institutionen för informationsteknologi, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:his:diva-17893.

Full text

Abstract:

This thesis project addresses the machine learning (ML) modelling aspects of the problem of automatically extracting typological linguistic information of natural languages spoken in South Asia from annotated descriptive grammars. Without getting stuck into the theory and methods of Natural Language Processing (NLP), the focus has been to develop and test a machine learning (ML) model dedicated to the information extraction part. Starting with the existing state-of-the-art frameworks to get labelled training data through the structured representation of the descriptive grammars, the problem has been modelled as a supervised ML classification task where the annotated text is provided as input and the objective is to classify the input to one of the pre-learned labels. The approach has been to systematically explore the data to develop understanding of the problem domain and then evaluate a set of four potential ML algorithms using predetermined performance metrics namely: accuracy, recall, precision and f-score. It turned out that the problem splits up into two independent classification tasks: binary classification task and multiclass classification task. The four selected algorithms: Decision Trees, Naïve Bayes, Support VectorMachines, and Logistic Regression belonging to both linear and non-linear families ofML models are independently trained and compared for both classification tasks. Using stratified 10-fold cross validation performance metrics are measured and the candidate algorithms are compared. Logistic Regression provided overall best results with DecisionTree as the close follow up. Finally, the Logistic Regression model was selected for further fine tuning and used in a web demo for typological information extraction tool developed to show the usability of the ML model in the field.

APA, Harvard, Vancouver, ISO, and other styles

33

Hohm, Joseph Brandon 1982. "Automatic classification of documents with an in-depth analysis of information extraction and automatic summarization." Thesis, Massachusetts Institute of Technology, 2004. http://hdl.handle.net/1721.1/29415.

Full text

Abstract:

Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Civil and Environmental Engineering, 2004.
Includes bibliographical references (leaves 78-80).
Today, annual information fabrication per capita exceeds two hundred and fifty megabytes. As the amount of data increases, classification and retrieval methods become more necessary to find relevant information. This thesis describes a .Net application (named I-Document) that establishes an automatic classification scheme in a peer-to-peer environment that allows free sharing of academic, business, and personal documents. A Web service architecture for metadata extraction, Information Extraction, Information Retrieval, and text summarization is depicted. Specific details regarding the coding process, competition, business model, and technology employed in the project are also discussed.
by Joseph Brandon Hohm.
M.Eng.

APA, Harvard, Vancouver, ISO, and other styles

34

Lipani, Aldo. "Query rewriting in information retrieval: automatic context extraction from local user documents to improve query results." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2012. http://amslaurea.unibo.it/4528/.

Full text

Abstract:

The central objective of research in Information Retrieval (IR) is to discover new techniques to retrieve relevant information in order to satisfy an Information Need. The Information Need is satisfied when relevant information can be provided to the user. In IR, relevance is a fundamental concept which has changed over time, from popular to personal, i.e., what was considered relevant before was information for the whole population, but what is considered relevant now is specific information for each user. Hence, there is a need to connect the behavior of the system to the condition of a particular person and his social context; thereby an interdisciplinary sector called Human-Centered Computing was born. For the modern search engine, the information extracted for the individual user is crucial. According to the Personalized Search (PS), two different techniques are necessary to personalize a search: contextualization (interconnected conditions that occur in an activity), and individualization (characteristics that distinguish an individual). This movement of focus to the individual's need undermines the rigid linearity of the classical model overtaken the ``berry picking'' model which explains that the terms change thanks to the informational feedback received from the search activity introducing the concept of evolution of search terms. The development of Information Foraging theory, which observed the correlations between animal foraging and human information foraging, also contributed to this transformation through attempts to optimize the cost-benefit ratio. This thesis arose from the need to satisfy human individuality when searching for information, and it develops a synergistic collaboration between the frontiers of technological innovation and the recent advances in IR. The search method developed exploits what is relevant for the user by changing radically the way in which an Information Need is expressed, because now it is expressed through the generation of the query and its own context. As a matter of fact the method was born under the pretense to improve the quality of search by rewriting the query based on the contexts automatically generated from a local knowledge base. Furthermore, the idea of optimizing each IR system has led to develop it as a middleware of interaction between the user and the IR system. Thereby the system has just two possible actions: rewriting the query, and reordering the result. Equivalent actions to the approach was described from the PS that generally exploits information derived from analysis of user behavior, while the proposed approach exploits knowledge provided by the user. The thesis went further to generate a novel method for an assessment procedure, according to the "Cranfield paradigm", in order to evaluate this type of IR systems. The results achieved are interesting considering both the effectiveness achieved and the innovative approach undertaken together with the several applications inspired using a local knowledge base.

APA, Harvard, Vancouver, ISO, and other styles

35

Constantin, Alexandru. "Automatic structure and keyphrase analysis of scientific publications." Thesis, University of Manchester, 2014. https://www.research.manchester.ac.uk/portal/en/theses/automatic-structure-and-keyphrase-analysis-of-scientific-publications(2cfe0b83-5cbb-4305-942c-031945437056).html.

Full text

Abstract:

Purpose. This work addresses an escalating problem within the realm of scientific publishing, that stems from accelerated publication rates of article formats difficult to process automatically. The amount of manual labour required to organise a comprehensive corpus of relevant literature has long been impractical. This has, in effect, reduced research efficiency and delayed scientific advancement. Two complementary approaches meant to alleviate this problem are detailed and improved upon beyond the current state-of-the-art, namely logical structure recovery of articles and keyphrase extraction. Methodology. The first approach targets the issue of flat-format publishing. It performs a structural analysis of the camera-ready PDF article and recognises its fine-grained organisation over logical units. The second approach is the application of a keyphrase extraction algorithm that relies on rhetorical information from the recovered structure to better contour an article’s true points of focus. A recount of the scientific article’s function, content and structure is provided, along with insights into how different logical components such as section headings or the bibliography can be automatically identified and utilised for higher-quality keyphrase extraction. Findings. Structure recovery can be carried out independently of an article’s formatting specifics, by exploiting conventional dependencies between logical components. In addition, access to an article’s logical structure is beneficial across term extraction approaches, reducing input noise and facilitating the emphasis of regions of interest. Value. The first part of this work details a novel method for recovering the rhetorical structure of scientific articles that is competitive with state-of-the-art machine learning techniques, yet requires no layout-specific tuning or prior training. The second part showcases a keyphrase extraction algorithm that outperforms other solutions in an established benchmark, yet does not rely on collection statistics or external knowledge sources in order to be proficient.

APA, Harvard, Vancouver, ISO, and other styles

36

Turroni, Francesco <1983&gt. "Fingerprint Recognition: Enhancement, Feature Extraction and Automatic Evaluation of Algorithms." Doctoral thesis, Alma Mater Studiorum - Università di Bologna, 2012. http://amsdottorato.unibo.it/4378/.

Full text

Abstract:

The identification of people by measuring some traits of individual anatomy or physiology has led to a specific research area called biometric recognition. This thesis is focused on improving fingerprint recognition systems considering three important problems: fingerprint enhancement, fingerprint orientation extraction and automatic evaluation of fingerprint algorithms. An effective extraction of salient fingerprint features depends on the quality of the input fingerprint. If the fingerprint is very noisy, we are not able to detect a reliable set of features. A new fingerprint enhancement method, which is both iterative and contextual, is proposed. This approach detects high-quality regions in fingerprints, selectively applies contextual filtering and iteratively expands like wildfire toward low-quality ones. A precise estimation of the orientation field would greatly simplify the estimation of other fingerprint features (singular points, minutiae) and improve the performance of a fingerprint recognition system. The fingerprint orientation extraction is improved following two directions. First, after the introduction of a new taxonomy of fingerprint orientation extraction methods, several variants of baseline methods are implemented and, pointing out the role of pre- and post- processing, we show how to improve the extraction. Second, the introduction of a new hybrid orientation extraction method, which follows an adaptive scheme, allows to improve significantly the orientation extraction in noisy fingerprints. Scientific papers typically propose recognition systems that integrate many modules and therefore an automatic evaluation of fingerprint algorithms is needed to isolate the contributions that determine an actual progress in the state-of-the-art. The lack of a publicly available framework to compare fingerprint orientation extraction algorithms, motivates the introduction of a new benchmark area called FOE (including fingerprints and manually-marked orientation ground-truth) along with fingerprint matching benchmarks in the FVC-onGoing framework. The success of such framework is discussed by providing relevant statistics: more than 1450 algorithms submitted and two international competitions.

APA, Harvard, Vancouver, ISO, and other styles

37

Ou, Shiyan, Christopher S. G. Khoo, and Dion H. Goh. "Automatic multi-document summarization for digital libraries." School of Communication & Information, Nanyang Technological University, 2006. http://hdl.handle.net/10150/106042.

Full text

Abstract:

With the rapid growth of the World Wide Web and online information services, more and more information is available and accessible online. Automatic summarization is an indispensable solution to reduce the information overload problem. Multi-document summarization is useful to provide an overview of a topic and allow users to zoom in for more details on aspects of interest. This paper reports three types of multi-document summaries generated for a set of research abstracts, using different summarization approaches: a sentence-based summary generated by a MEAD summarization system that extracts important sentences using various features, another sentence-based summary generated by extracting research objective sentences, and a variable-based summary focusing on research concepts and relationships. A user evaluation was carried out to compare the three types of summaries. The evaluation results indicated that the majority of users (70%) preferred the variable-based summary, while 55% of the users preferred the research objective summary, and only 25% preferred the MEAD summary.

APA, Harvard, Vancouver, ISO, and other styles

38

Wang, Yadong. "Represensting signals using only timing information and feature extraction for automatic speech recognition /." View online ; access limited to URI, 2003. http://0-wwwlib.umi.com.helin.uri.edu/dissertations/dlnow/3115640.

Full text

APA, Harvard, Vancouver, ISO, and other styles

39

Sobania, A. S. "The automatic extraction of 3D information from stereoscopic dual-energy X-ray images." Thesis, Nottingham Trent University, 2003. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.271786.

Full text

APA, Harvard, Vancouver, ISO, and other styles

40

Woodbury, Charla Jean. "Automatic Extraction From and Reasoning About Genealogical Records: A Prototype." BYU ScholarsArchive, 2010. https://scholarsarchive.byu.edu/etd/2335.

Full text

Abstract:

Family history research on the web is increasing in popularity, and many competing genealogical websites host large amounts of data-rich, unstructured, primary genealogical records. It is labor-intensive, however, even after making these records machine-readable, for humans to make these records easily searchable. What we need are computer tools that can automatically produce indices and databases from these genealogical records and can automatically identify individuals and events, determine relationships, and put families together. We propose here a possible solution—specialized ontologies, built specifically for extracting information from primary genealogical records, with expert logic and rules to infer genealogical facts and assemble relationship links between persons with respect to the genealogical events in their lives. The deliverables of this solution are extraction ontologies that can extract from parish or town records, annotated versions of original documents, data files of individuals and events, and rules to infer family relationships from stored data. The solution also provides for the ability to query over the rules and data files and to obtain query-result justification linking back to primary genealogical records. An evaluation of the prototype solution shows that the extraction has excellent recall and precision results and that inferred facts are correct.

APA, Harvard, Vancouver, ISO, and other styles

41

El-Harby, Ahmed Ahmed Abd El-Fattah. "Automatic extraction of vector representations of line features from remotely sensed images." Thesis, Keele University, 2001. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.344096.

Full text

APA, Harvard, Vancouver, ISO, and other styles

42

Jin, Xiaoying. "Automatic extraction of man-made objects from high-resolution satellite imagery by information fusion." Diss., Columbia, Mo. : University of Missouri-Columbia, 2005. http://hdl.handle.net/10355/5816.

Full text

Abstract:

Thesis (Ph.D.)--University of Missouri-Columbia, 2005.
The entire dissertation/thesis text is included in the research.pdf file; the official abstract appears in the short.pdf file (which also appears in the research.pdf); a non-technical general description, or public abstract, appears in the public.pdf file. Title from title screen of research.pdf file viewed on (November 15, 2006) Vita. Includes bibliographical references.

APA, Harvard, Vancouver, ISO, and other styles

43

Siau, Nor Zainah. "A teachable semi-automatic web information extraction system based on evolved regular expression patterns." Thesis, Loughborough University, 2014. https://dspace.lboro.ac.uk/2134/14687.

Full text

Abstract:

This thesis explores Web Information Extraction (WIE) and how it has been used in decision making and to support businesses in their daily operations. The research focuses on a WIE system based on Genetic Programming (GP) with an extensible model to enhance the automatic extractor. This uses a human as a teacher to identify and extract relevant information from the semi-structured HTML webpages. Regular expressions, which have been chosen as the pattern matching tool, are automatically generated based on the training data to provide an improved grammar and lexicon. This particularly benefits the GP system which may need to extend its lexicon in the presence of new tokens in the web pages. These tokens allow the GP method to produce new extraction patterns for new requirements.

APA, Harvard, Vancouver, ISO, and other styles

44

Quirchmayr, Thomas [Verfasser], and Barbara [Akademischer Betreuer] Paech. "Retrospective Semi-automated Software Feature Extraction from Natural Language User Manuals / Thomas Quirchmayr ; Betreuer: Barbara Paech." Heidelberg : Universitätsbibliothek Heidelberg, 2018. http://d-nb.info/1177149354/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

45

Kucuk, Dilek. "Exploiting Information Extraction Techniques For Automatic Semantic Annotation And Retrieval Of News Videos In Turkish." Phd thesis, METU, 2011. http://etd.lib.metu.edu.tr/upload/12613043/index.pdf.

Full text

Abstract:

Information extraction (IE) is known to be an effective technique for automatic semantic indexing of news texts. In this study, we propose a text-based fully automated system for the semantic annotation and retrieval of news videos in Turkish which exploits several IE techniques on the video texts. The IE techniques employed by the system include named entity recognition, automatic hyperlinking, person entity extraction with coreference resolution, and event extraction. The system utilizes the outputs of the components implementing these IE techniques as the semantic annotations for the underlying news video archives. Apart from the IE components, the proposed system comprises a news video database in addition to components for news story segmentation, sliding text recognition, and semantic video retrieval. We also propose a semi-automatic counterpart of system where the only manual intervention takes place during text extraction. Both systems are executed on genuine video data sets consisting of videos broadcasted by Turkish Radio and Television Corporation. The current study is significant as it proposes the first fully automated system to facilitate semantic annotation and retrieval of news videos in Turkish, yet the proposed system and its semi-automated counterpart are quite generic and hence they could be customized to build similar systems for video archives in other languages as well. Moreover, IE research on Turkish texts is known to be rare and within the course of this study, we have proposed and implemented novel techniques for several IE tasks on Turkish texts. As an application example, we have demonstrated the utilization of the implemented IE components to facilitate multilingual video retrieval.

APA, Harvard, Vancouver, ISO, and other styles

46

Wang, Guiwei. "Automatic information extraction and prediction of karst rocky desertification in Puding using remote sensing data." Thesis, Högskolan i Gävle, Samhällsbyggnad, GIS, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:hig:diva-23988.

Full text

Abstract:

Karst rocky desertification (KRD) is one kind of severe environmental problem existing in southwest of China. Reveal KRD condition is vital to solve the problem. A way to address the problem is by identifying KRD areas, so that policy-makers and researchers may get a better view of the issue and know where the areas affected by the problem are located. The study area is called Puding which is a county located in the central part of Guizhou province. Based on Landsat data, by using GIS and RS techniques, KRD information of Puding was extracted. Furthermore, the study monitored decades of change of the environmental problem in Puding and predicted possible condition in the future. Other researchers and decision makers may get a better view of the issue from the study results. In addition to Landsat data, other used data includes: ASTER Global digital elevation model data, Modis data, Google Earth data and other thematic maps. In the study, expert classification system and spectral features based model two methods were applied to extract KRD information and compare with each other. Their classified rules were taken from previous studies separately. Necessary preprocessing procedures such as atmospheric correction and geometrical correction were performed before extraction. After extraction relevant results were evaluated and analyzed. Predictions were made by cellular automata Markov module. Based on extracted KRD results, the distribution, percentage, change, and prediction of KRD conditions in Puding were presented. The results of the accuracy evaluation showed that the spectral features based model had acceptable performance. However, the KRD results extracted by expert classification system method were poor. The extracted KRD results, including KRD maps and the prediction map, both indicated that KRD areas in Puding were decreased from 1993 (spring) to 2016 (spring) and suggested to pay more attention to KRD areas changes with the seasons

APA, Harvard, Vancouver, ISO, and other styles

47

Johansson, Elias. "Separation and Extraction of Valuable Information From Digital Receipts Using Google Cloud Vision OCR." Thesis, Linnéuniversitetet, Institutionen för datavetenskap och medieteknik (DM), 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-88602.

Full text

Abstract:

Automatization is a desirable feature in many business areas. Manually extracting information from a physical object such as a receipt is something that can be automated to save resources for a company or a private person. In this paper the process will be described of combining an already existing OCR engine with a developed python script to achieve data extraction of valuable information from a digital image of a receipt. Values such as VAT, VAT%, date, total-, gross-, and net-cost; will be considered as valuable information. This is a feature that has already been implemented in existing applications. However, the company that I have done this project for are interested in creating their own version. This project is an experiment to see if it is possible to implement such an application using restricted resources. To develop a program that can extract the information mentioned above. In this paper you will be guided though the process of the development of the program. As well as indulging in the mindset, findings and the steps taken to overcome the problems encountered along the way. The program achieved a success rate of 86.6% in extracting the most valuable information: total cost, VAT% and date from a set of 53 receipts originated from 34 separate establishments.

APA, Harvard, Vancouver, ISO, and other styles

48

Moncla, Ludovic. "Automatic Reconstruction of Itineraries from Descriptive Texts." Thesis, Pau, 2015. http://www.theses.fr/2015PAUU3029/document.

Full text

Abstract:

Cette thèse s'inscrit dans le cadre du projet PERDIDO dont les objectifs sont l'extraction et la reconstruction d'itinéraires à partir de documents textuels. Ces travaux ont été réalisés en collaboration entre le laboratoire LIUPPA de l'université de Pau et des Pays de l'Adour (France), l'équipe IAAA de l'université de Saragosse (Espagne) et le laboratoire COGIT de l'IGN (France). Les objectifs de cette thèse sont de concevoir un système automatique permettant d'extraire, dans des récits de voyages ou des descriptions d’itinéraires, des déplacements, puis de les représenter sur une carte. Nous proposons une approche automatique pour la représentation d'un itinéraire décrit en langage naturel. Notre approche est composée de deux tâches principales. La première tâche a pour rôle d'identifier et d'extraire les informations qui décrivent l'itinéraire dans le texte, comme par exemple les entités nommées de lieux et les expressions de déplacement ou de perception. La seconde tâche a pour objectif la reconstruction de l'itinéraire. Notre proposition combine l'utilisation d'information extraites grâce au traitement automatique du langage ainsi que des données extraites de ressources géographiques externes (comme des gazetiers). L'étape d'annotation d'informations spatiales est réalisée par une approche qui combine l'étiquetage morpho-syntaxique et des patrons lexico-syntaxiques (cascade de transducteurs) afin d'annoter des entités nommées spatiales et des expressions de déplacement ou de perception. Une première contribution au sein de la première tâche est la désambiguïsation des toponymes, qui est un problème encore mal résolu en NER et essentiel en recherche d'information géographique. Nous proposons un algorithme non-supervisé de géo-référencement basé sur une technique de clustering capable de proposer une solution pour désambiguïser les toponymes trouvés dans les ressources géographiques externes, et dans le même temps proposer une estimation de la localisation des toponymes non référencés. Nous proposons un modèle de graphe générique pour la reconstruction automatique d'itinéraires, où chaque noeud représente un lieu et chaque segment représente un chemin reliant deux lieux. L'originalité de notre modèle est qu'en plus de tenir compte des éléments habituels (chemins et points de passage), il permet de représenter les autres éléments impliqués dans la description d'un itinéraire, comme par exemple les points de repères visuels. Un calcul d'arbre de recouvrement minimal à partir d'un graphe pondéré est utilisé pour obtenir automatiquement un itinéraire sous la forme d'un graphe. Chaque segment du graphe initial est pondéré en utilisant une méthode d'analyse multi-critère combinant des critères qualitatifs et des critères quantitatifs. La valeur des critères est déterminée à partir d'informations extraites du texte et d'informations provenant de ressources géographique externes. Par exemple, nous combinons les informations issues du traitement automatique de la langue comme les relations spatiales décrivant une orientation (ex: se diriger vers le sud) avec les coordonnées géographiques des lieux trouvés dans les ressources pour déterminer la valeur du critère "relation spatiale". De plus, à partir de la définition du concept d'itinéraire et des informations utilisées dans la langue pour décrire un itinéraire, nous avons modélisé un langage d'annotation d'information spatiale adapté à la description de déplacements, s'appuyant sur les recommendations du consortium TEI (Text Encoding and Interchange). Enfin, nous avons implémenté et évalué les différentes étapes de notre approche sur un corpus multilingue de descriptions de randonnées (Français, Espagnol et Italien)
This PhD thesis is part of the research project PERDIDO, which aims at extracting and retrieving displacements from textual documents. This work was conducted in collaboration with the LIUPPA laboratory of the university of Pau (France), the IAAA team of the university of Zaragoza (Spain) and the COGIT laboratory of IGN (France). The objective of this PhD is to propose a method for establishing a processing chain to support the geoparsing and geocoding of text documents describing events strongly linked with space. We propose an approach for the automatic geocoding of itineraries described in natural language. Our proposal is divided into two main tasks. The first task aims at identifying and extracting information describing the itinerary in texts such as spatial named entities and expressions of displacement or perception. The second task deal with the reconstruction of the itinerary. Our proposal combines local information extracted using natural language processing and physical features extracted from external geographical sources such as gazetteers or datasets providing digital elevation models. The geoparsing part is a Natural Language Processing approach which combines the use of part of speech and syntactico-semantic combined patterns (cascade of transducers) for the annotation of spatial named entities and expressions of displacement or perception. The main contribution in the first task of our approach is the toponym disambiguation which represents an important issue in Geographical Information Retrieval (GIR). We propose an unsupervised geocoding algorithm that takes profit of clustering techniques to provide a solution for disambiguating the toponyms found in gazetteers, and at the same time estimating the spatial footprint of those other fine-grain toponyms not found in gazetteers. We propose a generic graph-based model for the automatic reconstruction of itineraries from texts, where each vertex represents a location and each edge represents a path between locations. %, combining information extracted from texts and information extracted from geographical databases. Our model is original in that in addition to taking into account the classic elements (paths and waypoints), it allows to represent the other elements describing an itinerary, such as features seen or mentioned as landmarks. To build automatically this graph-based representation of the itinerary, our approach computes an informed spanning tree on a weighted graph. Each edge of the initial graph is weighted using a multi-criteria analysis approach combining qualitative and quantitative criteria. Criteria are based on information extracted from the text and information extracted from geographical sources. For instance, we compare information given in the text such as spatial relations describing orientation (e.g., going south) with the geographical coordinates of locations found in gazetteers. Finally, according to the definition of an itinerary and the information used in natural language to describe itineraries, we propose a markup langugage for encoding spatial and motion information based on the Text Encoding and Interchange guidelines (TEI) which defines a standard for the representation of texts in digital form. Additionally, the rationale of the proposed approach has been verified with a set of experiments on a corpus of multilingual hiking descriptions (French, Spanish and Italian)

APA, Harvard, Vancouver, ISO, and other styles

49

Afzal, Naveed. "Unsupervised relation extraction for e-learning applications." Thesis, University of Wolverhampton, 2011. http://hdl.handle.net/2436/299064.

Full text

Abstract:

In this modern era many educational institutes and business organisations are adopting the e-Learning approach as it provides an effective method for educating and testing their students and staff. The continuous development in the area of information technology and increasing use of the internet has resulted in a huge global market and rapid growth for e-Learning. Multiple Choice Tests (MCTs) are a popular form of assessment and are quite frequently used by many e-Learning applications as they are well adapted to assessing factual, conceptual and procedural information. In this thesis, we present an alternative to the lengthy and time-consuming activity of developing MCTs by proposing a Natural Language Processing (NLP) based approach that relies on semantic relations extracted using Information Extraction to automatically generate MCTs. Information Extraction (IE) is an NLP field used to recognise the most important entities present in a text, and the relations between those concepts, regardless of their surface realisations. In IE, text is processed at a semantic level that allows the partial representation of the meaning of a sentence to be produced. IE has two major subtasks: Named Entity Recognition (NER) and Relation Extraction (RE). In this work, we present two unsupervised RE approaches (surface-based and dependency-based). The aim of both approaches is to identify the most important semantic relations in a document without assigning explicit labels to them in order to ensure broad coverage, unrestricted to predefined types of relations. In the surface-based approach, we examined different surface pattern types, each implementing different assumptions about the linguistic expression of semantic relations between named entities while in the dependency-based approach we explored how dependency relations based on dependency trees can be helpful in extracting relations between named entities. Our findings indicate that the presented approaches are capable of achieving high precision rates. Our experiments make use of traditional, manually compiled corpora along with similar corpora automatically collected from the Web. We found that an automatically collected web corpus is still unable to ensure the same level of topic relevance as attained in manually compiled traditional corpora. Comparison between the surface-based and the dependency-based approaches revealed that the dependency-based approach performs better. Our research enabled us to automatically generate questions regarding the important concepts present in a domain by relying on unsupervised relation extraction approaches as extracted semantic relations allow us to identify key information in a sentence. The extracted patterns (semantic relations) are then automatically transformed into questions. In the surface-based approach, questions are automatically generated from sentences matched by the extracted surface-based semantic pattern which relies on a certain set of rules. Conversely, in the dependency-based approach questions are automatically generated by traversing the dependency tree of extracted sentence matched by the dependency-based semantic patterns. The MCQ systems produced from these surface-based and dependency-based semantic patterns were extrinsically evaluated by two domain experts in terms of questions and distractors readability, usefulness of semantic relations, relevance, acceptability of questions and distractors and overall MCQ usability. The evaluation results revealed that the MCQ system based on dependency-based semantic relations performed better than the surface-based one. A major outcome of this work is an integrated system for MCQ generation that has been evaluated by potential end users.

APA, Harvard, Vancouver, ISO, and other styles

50

Nyström, Stefan. "Evaluation of a New Method for Extraction of Drift-Stable Information from Electronic Tongue Measurements." Thesis, Linköping University, Department of Electrical Engineering, 2003. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-1615.

Full text

Abstract:

This thesis is a part of a project where a new method, the base descriptor approach, is studied. The purpose of this method is to reduce drift and extract vital information from electronic tongue measurements. Reference solutions, called descriptors, are measured and the measurements are used to find base descriptors. A base descriptor is, in this thesis, a regression vector for prediction of the property that the descriptor represent. The property is in this case the concentration of a chemical substance in the descriptor solution. Measurements from test samples, in this case fruit juices, are projected onto the base descriptors to extract vital and drift-stable information from the test samples.

The base descriptors are used to determine the concentrations of the descriptors'chemical substances in the juices and thereby also to classify the different juices. It is assumed that the measurements of samples of juices and descriptors drift the same way. This assumption has to be true in order for the base descriptor approach to work. The base descriptors are calculated by multivariate regression methods like partial least squares regression (PLSR) and principal component regression (PCR).

Only two of the descriptors tested in this thesis worked as basis for base descriptors. The base descriptors'predictions of the concentrations of chemical substances in the juices are hard to evaluate since the true concentrations are unknown. Comparing the projections of juice measurements onto the base descriptors with a classification model on the juice measurements performed by principal component analysis (PCA), there is no significant difference in drift of the juice measurements in the results of the two methods. The base descriptors, however, separates the juices for classification somewhat better than the classification of juices performed by PCA.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Automated information extraction'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles