Log in

Relevant bibliographies by topics / LM. Automatic text retrieval / Dissertations / Theses

To see the other types of publications on this topic, follow the link: LM. Automatic text retrieval.

Dissertations / Theses on the topic 'LM. Automatic text retrieval'

Author: Grafiati

Published: 10 March 2023

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 38 dissertations / theses for your research on the topic 'LM. Automatic text retrieval.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Viana, Hugo Henrique Amorim. "Automatic information retrieval through text-mining." Master's thesis, Faculdade de Ciências e Tecnologia, 2013. http://hdl.handle.net/10362/11308.

Full text

Abstract:

The dissertation presented for obtaining the Master’s Degree in Electrical Engineering and Computer Science, at Universidade Nova de Lisboa, Faculdade de Ciências e Tecnologia
Nowadays, around a huge amount of firms in the European Union catalogued as Small and Medium Enterprises (SMEs), employ almost a great portion of the active workforce in Europe. Nonetheless, SMEs cannot afford implementing neither methods nor tools to systematically adapt innovation as a part of their business process. Innovation is the engine to be competitive in the globalized environment, especially in the current socio-economic situation. This thesis provides a platform that when integrated with ExtremeFactories(EF) project, aids SMEs to become more competitive by means of monitoring schedule functionality. In this thesis a text-mining platform that possesses the ability to schedule a gathering information through keywords is presented. In order to develop the platform, several choices concerning the implementation have been made, in the sense that one of them requires particular emphasis is the framework, Apache Lucene Core 2 by supplying an efficient text-mining tool and it is highly used for the purpose of the thesis.

APA, Harvard, Vancouver, ISO, and other styles

2

Lee, Hyo Sook. "Automatic text processing for Korean language free text retrieval." Thesis, University of Sheffield, 2000. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.322916.

Full text

APA, Harvard, Vancouver, ISO, and other styles

3

Kay, Roderick Neil. "Text analysis, summarising and retrieval." Thesis, University of Salford, 1997. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.360435.

Full text

APA, Harvard, Vancouver, ISO, and other styles

4

Goyal, Pawan. "Analytic knowledge discovery techniques for ad-hoc information retrieval and automatic text summarization." Thesis, Ulster University, 2011. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.543897.

Full text

Abstract:

Information retrieval is broadly concerned with the problem of automated searching for information within some document repository to support various information requests by users. The traditional retrieval frameworks work on the simplistic assumptions of “word independence” and “bag-of-words”, giving rise to problems such as “term mismatch” and “context independent document indexing”. Automatic text summarization systems, which use the same paradigm as that of information retrieval, also suffer from these problems. The concept of “semantic relevance” has also not been formulated in the existing literature. This thesis presents a detailed investigation of the knowledge discovery models and proposes new approaches to address these issues. The traditional retrieval frameworks do not succeed in defining the document content fully because they do not process the concepts in the documents; only the words are processed. To address this issue, a document retrieval model has been proposed using concept hierarchies, learnt automatically from a corpora. A novel approach to give a meaningful representation to the concept nodes in a learnt hierarchy has been proposed using a fuzzy logic based soft least upper bound method. A novel approach of adapting the vector space model with dependency parse relations for information retrieval also has been developed. A user query for information retrieval (IR) applications may not contain the most appropriate terms (words) as actually intended by the user. This is usually referred to as the term mismatch problem and is a crucial research issue in IR. To address this issue, a theoretical framework for Query Representation (QR) has been developed through a comprehensive theoretical analysis of a parametric query vector. A lexical association function has been derived analytically using the relevance criteria. The proposed QR model expands the user query using this association function. A novel term association metric has been derived using the Bernoulli model of randomness. x The derived metric has been used to develop a Bernoulli Query Expansion (BQE) model. The Bernoulli model of randomness has also been extended to the pseudo relevance feedback problem by proposing a Bernoulli Pseudo Relevance (BPR) model. In the traditional retrieval frameworks, the context in which a term occurs is mostly overlooked in assigning its indexing weight. This results in context independent document indexing. To address this issue, a novel Neighborhood Based Document Smoothing (NBDS) model has been proposed, which uses the lexical association between terms to provide a context sensitive indexing weight to the document terms, i.e. the term weights are redistributed based on the lexical association with the context words. To address the “context independent document indexing” for sentence extraction based text summarization task, a lexical association measure derived using the Bernoulli model of randomness has been used. A new approach using the lexical association between terms has been proposed to give a context sensitive weight to the document terms and these weights have been used for the sentence extraction task. Developed analytically, the proposed QR, BQE, BPR and NBDS models provide a proper mathematical framework for query expansion and document smoothing techniques, which have largely been heuristic in the existing literature. Being developed in the generalized retrieval framework, as also proposed in this thesis, these models are applicable to all of the retrieval frameworks. These models have been empirically evaluated over the benchmark TREC datasets and have been shown to provide significantly better performance than the baseline retrieval frameworks to a large degree, without adding significant computational or storage burden. The Bernoulli model applied to the sentence extraction task has also been shown to enhance the performance of the baseline text summarization systems over the benchmark DUC datasets. The theoretical foundations alongwith the empirical results verify that the proposed knowledge discovery models in this thesis advance the state of the art in the field of information retrieval and automatic text summarization.

APA, Harvard, Vancouver, ISO, and other styles

5

McMurtry, William F. "Information Retrieval for Call Center Quality Assurance." The Ohio State University, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=osu1587036885211228.

Full text

APA, Harvard, Vancouver, ISO, and other styles

6

Brucato, Matteo. "Temporal Information Retrieval." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2013. http://amslaurea.unibo.it/5690/.

Full text

APA, Harvard, Vancouver, ISO, and other styles

7

Ermakova, Liana. "Short text contextualization in information retrieval : application to tweet contextualization and automatic query expansion." Thesis, Toulouse 2, 2016. http://www.theses.fr/2016TOU20023/document.

Full text

Abstract:

La communication efficace a tendance à suivre la loi du moindre effort. Selon ce principe, en utilisant une langue donnée les interlocuteurs ne veulent pas travailler plus que nécessaire pour être compris. Ce fait mène à la compression extrême de textes surtout dans la communication électronique, comme dans les microblogues, SMS, ou les requêtes dans les moteurs de recherche. Cependant souvent ces textes ne sont pas auto-suffisants car pour les comprendre, il est nécessaire d’avoir des connaissances sur la terminologie, les entités nommées ou les faits liés. Ainsi, la tâche principale de la recherche présentée dans ce mémoire de thèse de doctorat est de fournir le contexte d’un texte court à l’utilisateur ou au système comme à un moteur de recherche par exemple.Le premier objectif de notre travail est d'aider l’utilisateur à mieux comprendre un message court par l’extraction du contexte d’une source externe comme le Web ou la Wikipédia au moyen de résumés construits automatiquement. Pour cela nous proposons une approche pour le résumé automatique de documents multiples et nous l’appliquons à la contextualisation de messages, notamment à la contextualisation de tweets. La méthode que nous proposons est basée sur la reconnaissance des entités nommées, la pondération des parties du discours et la mesure de la qualité des phrases. Contrairement aux travaux précédents, nous introduisons un algorithme de lissage en fonction du contexte local. Notre approche s’appuie sur la structure thème-rhème des textes. De plus, nous avons développé un algorithme basé sur les graphes pour le ré-ordonnancement des phrases. La méthode a été évaluée à la tâche INEX/CLEF Tweet Contextualization sur une période de 4 ans. La méthode a été également adaptée pour la génération de snippets. Les résultats des évaluations attestent une bonne performance de notre approche
The efficient communication tends to follow the principle of the least effort. According to this principle, using a given language interlocutors do not want to work any harder than necessary to reach understanding. This fact leads to the extreme compression of texts especially in electronic communication, e.g. microblogs, SMS, search queries. However, sometimes these texts are not self-contained and need to be explained since understanding them requires knowledge of terminology, named entities or related facts. The main goal of this research is to provide a context to a user or a system from a textual resource.The first aim of this work is to help a user to better understand a short message by extracting a context from an external source like a text collection, the Web or the Wikipedia by means of text summarization. To this end we developed an approach for automatic multi-document summarization and we applied it to short message contextualization, in particular to tweet contextualization. The proposed method is based on named entity recognition, part-of-speech weighting and sentence quality measuring. In contrast to previous research, we introduced an algorithm for smoothing from the local context. Our approach exploits topic-comment structure of a text. Moreover, we developed a graph-based algorithm for sentence reordering. The method has been evaluated at INEX/CLEF tweet contextualization track. We provide the evaluation results over the 4 years of the track. The method was also adapted to snippet retrieval. The evaluation results indicate good performance of the approach

APA, Harvard, Vancouver, ISO, and other styles

8

Sequeira, José Francisco Rodrigues. "Automatic knowledge base construction from unstructured text." Master's thesis, Universidade de Aveiro, 2016. http://hdl.handle.net/10773/17910.

Full text

Abstract:

Mestrado em Engenharia de Computadores e Telemática
Taking into account the overwhelming number of biomedical publications being produced, the effort required for a user to efficiently explore those publications in order to establish relationships between a wide range of concepts is staggering. This dissertation presents GRACE, a web-based platform that provides an advanced graphical exploration interface that allows users to traverse the biomedical domain in order to find explicit and latent associations between annotated biomedical concepts belonging to a variety of semantic types (e.g., Genes, Proteins, Disorders, Procedures and Anatomy). The knowledge base utilized is a collection of MEDLINE articles with English abstracts. These annotations are then stored in an efficient data storage that allows for complex queries and high-performance data delivery. Concept relationship are inferred through statistical analysis, applying association measures to annotated terms. These processes grant the graphical interface the ability to create, in real-time, a data visualization in the form of a graph for the exploration of these biomedical concept relationships.
Tendo em conta o crescimento do número de publicações biomédicas a serem produzidas todos os anos, o esforço exigido para que um utilizador consiga, de uma forma eficiente, explorar estas publicações para conseguir estabelecer associações entre um conjunto alargado de conceitos torna esta tarefa exaustiva. Nesta disertação apresentamos uma plataforma web chamada GRACE, que providencia uma interface gráfica de exploração que permite aos utilizadores navegar pelo domínio biomédico em busca de associações explícitas ou latentes entre conceitos biomédicos pertencentes a uma variedade de domínios semânticos (i.e., Genes, Proteínas, Doenças, Procedimentos e Anatomia). A base de conhecimento usada é uma coleção de artigos MEDLINE com resumos escritos na língua inglesa. Estas anotações são armazenadas numa base de dados que permite pesquisas complexas e obtenção de dados com alta performance. As relações entre conceitos são inferidas a partir de análise estatística, aplicando medidas de associações entre os conceitos anotados. Estes processos permitem à interface gráfica criar, em tempo real, uma visualização de dados, na forma de um grafo, para a exploração destas relações entre conceitos do domínio biomédico.

APA, Harvard, Vancouver, ISO, and other styles

9

Lipani, Aldo. "Query rewriting in information retrieval: automatic context extraction from local user documents to improve query results." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2012. http://amslaurea.unibo.it/4528/.

Full text

Abstract:

The central objective of research in Information Retrieval (IR) is to discover new techniques to retrieve relevant information in order to satisfy an Information Need. The Information Need is satisfied when relevant information can be provided to the user. In IR, relevance is a fundamental concept which has changed over time, from popular to personal, i.e., what was considered relevant before was information for the whole population, but what is considered relevant now is specific information for each user. Hence, there is a need to connect the behavior of the system to the condition of a particular person and his social context; thereby an interdisciplinary sector called Human-Centered Computing was born. For the modern search engine, the information extracted for the individual user is crucial. According to the Personalized Search (PS), two different techniques are necessary to personalize a search: contextualization (interconnected conditions that occur in an activity), and individualization (characteristics that distinguish an individual). This movement of focus to the individual's need undermines the rigid linearity of the classical model overtaken the ``berry picking'' model which explains that the terms change thanks to the informational feedback received from the search activity introducing the concept of evolution of search terms. The development of Information Foraging theory, which observed the correlations between animal foraging and human information foraging, also contributed to this transformation through attempts to optimize the cost-benefit ratio. This thesis arose from the need to satisfy human individuality when searching for information, and it develops a synergistic collaboration between the frontiers of technological innovation and the recent advances in IR. The search method developed exploits what is relevant for the user by changing radically the way in which an Information Need is expressed, because now it is expressed through the generation of the query and its own context. As a matter of fact the method was born under the pretense to improve the quality of search by rewriting the query based on the contexts automatically generated from a local knowledge base. Furthermore, the idea of optimizing each IR system has led to develop it as a middleware of interaction between the user and the IR system. Thereby the system has just two possible actions: rewriting the query, and reordering the result. Equivalent actions to the approach was described from the PS that generally exploits information derived from analysis of user behavior, while the proposed approach exploits knowledge provided by the user. The thesis went further to generate a novel method for an assessment procedure, according to the "Cranfield paradigm", in order to evaluate this type of IR systems. The results achieved are interesting considering both the effectiveness achieved and the innovative approach undertaken together with the several applications inspired using a local knowledge base.

APA, Harvard, Vancouver, ISO, and other styles

10

Martinez-Alvarez, Miguel. "Knowledge-enhanced text classification : descriptive modelling and new approaches." Thesis, Queen Mary, University of London, 2014. http://qmro.qmul.ac.uk/xmlui/handle/123456789/27205.

Full text

Abstract:

The knowledge available to be exploited by text classification and information retrieval systems has significantly changed, both in nature and quantity, in the last years. Nowadays, there are several sources of information that can potentially improve the classification process, and systems should be able to adapt to incorporate multiple sources of available data in different formats. This fact is specially important in environments where the required information changes rapidly, and its utility may be contingent on timely implementation. For these reasons, the importance of adaptability and flexibility in information systems is rapidly growing. Current systems are usually developed for specific scenarios. As a result, significant engineering effort is needed to adapt them when new knowledge appears or there are changes in the information needs. This research investigates the usage of knowledge within text classification from two different perspectives. On one hand, the application of descriptive approaches for the seamless modelling of text classification, focusing on knowledge integration and complex data representation. The main goal is to achieve a scalable and efficient approach for rapid prototyping for Text Classification that can incorporate different sources and types of knowledge, and to minimise the gap between the mathematical definition and the modelling of a solution. On the other hand, the improvement of different steps of the classification process where knowledge exploitation has traditionally not been applied. In particular, this thesis introduces two classification sub-tasks, namely Semi-Automatic Text Classification (SATC) and Document Performance Prediction (DPP), and several methods to address them. SATC focuses on selecting the documents that are more likely to be wrongly assigned by the system to be manually classified, while automatically labelling the rest. Document performance prediction estimates the classification quality that will be achieved for a document, given a classifier. In addition, we also propose a family of evaluation metrics to measure degrees of misclassification, and an adaptive variation of k-NN.

APA, Harvard, Vancouver, ISO, and other styles

11

Conteduca, Antonio. "L’uso di tecniche di similarità nell’editing di documenti fortemente strutturati." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2019. http://amslaurea.unibo.it/19653/.

Full text

Abstract:

Lo scopo di questo progetto è quello di verificare la tesi secondo cui l’utilizzo di metodologie basate sul calcolo della similarità può migliorare l’editing di documenti strutturati attraverso attività di ricerca, recupero e confronto che forniscono supporto testuale a vari livelli di dettaglio. La ricerca di oggetti simili è un campo molto importante del Text Mining e, a tal proposito, sono state analizzate e implementate due tecniche che permettono di ricercare frammenti quali frasi, paragrafi e sezioni partendo da un testo in input e da una fissata metrica di similarità. Lo scopo è di progettare un sistema di supporto testuale che fornisca suggerimenti all’utente durante la fase di editing di un nuovo documento in modo da facilitare la scrittura e rendendo il documento conforme alle linee guida di una collezione di documenti iniziale. Nel corso della dissertazione verranno trattate tematiche come il concetto di similarità e i modi per quantificarla; verrà discusso come rappresentare vettorialmente il testo e, sulla base di ciò, verrà analizzato nel dettaglio gli algoritmi MRCTA e Minhash, un’istanza dello schema LSH (Locality Sensitive Hashing), che permette di stimare la similarità di Jaccard; verrà descritta SHE ossia l’ambiente implementato per consentire le fasi di confronto e testing delle tecniche considerate; infine, verranno analizzati i risultati della fase di test il cui scopo è quello di quantificare la soddisfazione dell’utente nell’utilizzo del sistema in modo da individuare, attraverso l’ausilio del questionario SUS, la tecnica più adatta.

APA, Harvard, Vancouver, ISO, and other styles

12

Salvatori, Stefano. "Text-to-Image Information Retrieval Basato sul Transformer Lineare Performer: Sviluppo e Applicazioni per l'Industria della Moda." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2021.

Find full text

Abstract:

Il lavoro svolto si inserisce nell’ambito dei Neural Ranking Models, modelli che stanno gradualmente superando lo stato dell’arte raggiunto dai classici sistemi di Information Retrieval sfruttando i più recenti sviluppi ottenuti sulle reti neurali profonde. Una delle architetture più utilizzate in questo contesto è quella del Transformer, che si è dimostrata essere estremamente versatile ed efficace in svariati domini applicativi. Uno dei problemi che caratterizzano però questo modello è la complessità spaziale e temporale quadratica rispetto alla dimensione dell’input che non permette di sfruttare una dimensione del batch size ottimale e una lunghezza delle sequenze in input sufficientemente grande. Lo scopo di questo lavoro è studiare i miglioramenti ottenibili in un sistema di Information Retrieval basato su Neural Ranking Models applicando il transformer efficiente Performer. È stato scelto come caso di studio il dominio della moda, per il quale sono state proposte in letteratura diverse soluzioni nell’ambito dell’intelligenza artificiale per task di retrieval e non. Gao, Dehong, et al. in particolare, hanno ottenuto risultati allo stato dell’arte sviluppando FashionBERT, un neural ranking model basato BERT applicato a problemi di Text-Image Matching (dire se una descrizione ed un’immagine sono o meno legate allo stesso prodotto) e Retrieval (data una query testuale, ritrovare l’immagine dell’indumento che descrive). In questo lavoro si vuole mostrate innanzitutto come sia possibile migliorare i risultati di FashionBERT sia in termini di efficacia che efficienza sostituendo il layer di attention quadratica con la rispettiva versione lineare proposta in Performer. Vengono infine condotti ulteriori esperimenti applicando il modello sviluppato ad un task di Metric Learning dimostrando che è possibile in questo modo superare lo stato dell'arte ottenuto nel paper originale di FashionBERT.

APA, Harvard, Vancouver, ISO, and other styles

13

Bonzi, Francesco. "Lyrics Instrumentalness: An Automatic System for Vocal and Instrumental Recognition in Polyphonic Music with Deep Learning." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2021.

Find full text

Abstract:

Human voice recognition is a crucial task in music information retrieval. In this master thesis we developed an innovative AI system, called Instrumental- ness, to address the instrumental song tagging. An extended pipeline was proposed to fit the Instrumentalness require- ments, w.r.t. the well known tasks of Singing Voice Detection and Singing Voice Segmentation. A deep research on the available datasets was made and two different approaches were tried. The first one involves strongly labeled datasets and tested different neural architectures, while the second one used an attention mechanism to address a weakly labeled dataset, experimenting on different loss functions. Transfer learning was used to take advantage of the most recent architec- tures in the music information retrieval field, keeping the model efficient and effective. This work demonstrates that the quality of data is as important as its quan- tity. Moreover, the architectures to address strongly labeled datasets achieved the best performance, but it is remarkable that the attention mechanism used to address the weakly labeled datasets seems to be effective, even if the dataset was imbalanced and small.

APA, Harvard, Vancouver, ISO, and other styles

14

Dutra, Marcio Branquinho. "Busca guiada de patentes de Bioinformática." Universidade de São Paulo, 2013. http://www.teses.usp.br/teses/disponiveis/95/95131/tde-07022014-150130/.

Full text

Abstract:

As patentes são licenças públicas temporárias outorgadas pelo Estado e que garantem aos inventores e concessionários a exploração econômica de suas invenções. Escritórios de marcas e patentes recomendam aos interessados na concessão que, antes do pedido formal de uma patente, efetuem buscas em diversas bases de dados utilizando sistemas clássicos de busca de patentes e outras ferramentas de busca específicas, com o objetivo de certificar que a criação a ser depositada ainda não foi publicada, seja na sua área de origem ou em outras áreas. Pesquisas demonstram que a utilização de informações de classificação nas buscas por patentes melhoram a eficiência dos resultados das consultas. A pesquisa associada ao trabalho aqui reportado tem como objetivo explorar artefatos linguísticos, técnicas de Recuperação de Informação e técnicas de Classificação Textual para guiar a busca por patentes de Bioinformática. O resultado dessa investigação é o Sistema de Busca Guiada de Patentes de Bioinformática (BPS), o qual utiliza um classificador automático para guiar as buscas por patentes de Bioinformática. A utilização do BPS é demonstrada em comparações com ferramentas de busca de patentes atuais para uma coleção específica de patentes de Bioinformática. No futuro, deve-se experimentar o BPS em coleções diferentes e mais robustas.
Patents are temporary public licenses granted by the State to ensure to inventors and assignees economical exploration rights. Trademark and patent offices recommend to perform wide searches in different databases using classic patent search systems and specific tools before a patent\'s application. The goal of these searches is to ensure the invention has not been published yet, either in its original field or in other fields. Researches have shown the use of classification information improves the efficiency on searches for patents. The objetive of the research related to this work is to explore linguistic artifacts, Information Retrieval techniques and Automatic Classification techniques, to guide searches for Bioinformatics patents. The result of this work is the Bioinformatics Patent Search System (BPS), that uses automatic classification to guide searches for Bioinformatics patents. The utility of BPS is illustrated by a comparison with other patent search tools. In the future, BPS system must be experimented with more robust collections.

APA, Harvard, Vancouver, ISO, and other styles

15

Artchounin, Daniel. "Tuning of machine learning algorithms for automatic bug assignment." Thesis, Linköpings universitet, Programvara och system, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-139230.

Full text

Abstract:

In software development projects, bug triage consists mainly of assigning bug reports to software developers or teams (depending on the project). The partial or total automation of this task would have a positive economic impact on many software projects. This thesis introduces a systematic four-step method to find some of the best configurations of several machine learning algorithms intending to solve the automatic bug assignment problem. These four steps are respectively used to select a combination of pre-processing techniques, a bug report representation, a potential feature selection technique and to tune several classifiers. The aforementioned method has been applied on three software projects: 66 066 bug reports of a proprietary project, 24 450 bug reports of Eclipse JDT and 30 358 bug reports of Mozilla Firefox. 619 configurations have been applied and compared on each of these three projects. In production, using the approach introduced in this work on the bug reports of the proprietary project would have increased the accuracy by up to 16.64 percentage points.

APA, Harvard, Vancouver, ISO, and other styles

16

Piscaglia, Nicola. "Deep Learning for Natural Language Processing: Novel State-of-the-art Solutions in Summarisation of Legal Case Reports." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2020. http://amslaurea.unibo.it/20342/.

Full text

Abstract:

Deep neural networks are one of the major classification machines in machine learning. Several Deep Neural Networks (DNNs) have been developed and evaluated in recent years in recognition tasks, such as estimating a user base and estimating their interactivity. We present the best algorithms for extracting or summarising text using a deep neural network while allowing the workers to interpret the texts from the output speech. In this work both extractive and abstractive summarisation approaches have been applied. In particular, BERT (Base, Multilingual Cased) and a deep neural network composed by CNN and GRU layers have been used in the extraction-based summarisation while the abstraction-based one has been performed by applying the GPT-2 Transformer model. We show our models achieve high scores in syntactical terms while a human evaluation is still needed to judge the coherence, consistency and unreferenced harmonicity of speech. Our proposed work outperform the state of the art results for extractive summarisation on the Australian Legal Case Report Dataset. Our paper can be viewed as further demonstrating that our model can outperform the state of the art on a variety of extractive and abstractive summarisation tasks. Note: The abstract above was not written by the author, it was generated by providing a part of thesis introduction as input text to the pre-trained GPT-2 (Small) Transformer model used in this work which has been previously fine-tuned for 4 epochs with the”NIPS 2015 Papers” dataset.

APA, Harvard, Vancouver, ISO, and other styles

17

Dang, Quoc Bao. "Information spotting in huge repositories of scanned document images." Thesis, La Rochelle, 2018. http://www.theses.fr/2018LAROS024/document.

Full text

Abstract:

Ce travail vise à développer un cadre générique qui est capable de produire des applications de localisation d'informations à partir d’une caméra (webcam, smartphone) dans des très grands dépôts d'images de documents numérisés et hétérogènes via des descripteurs locaux. Ainsi, dans cette thèse, nous proposons d'abord un ensemble de descripteurs qui puissent être appliqués sur des contenus aux caractéristiques génériques (composés de textes et d’images) dédié aux systèmes de recherche et de localisation d'images de documents. Nos descripteurs proposés comprennent SRIF, PSRIF, DELTRIF et SSKSRIF qui sont construits à partir de l’organisation spatiale des points d’intérêts les plus proches autour d'un point-clé pivot. Tous ces points sont extraits à partir des centres de gravité des composantes connexes de l‘image. A partir de ces points d’intérêts, des caractéristiques géométriques invariantes aux dégradations sont considérées pour construire nos descripteurs. SRIF et PSRIF sont calculés à partir d'un ensemble local des m points d’intérêts les plus proches autour d'un point d’intérêt pivot. Quant aux descripteurs DELTRIF et SSKSRIF, cette organisation spatiale est calculée via une triangulation de Delaunay formée à partir d'un ensemble de points d’intérêts extraits dans les images. Cette seconde version des descripteurs permet d’obtenir une description de forme locale sans paramètres. En outre, nous avons également étendu notre travail afin de le rendre compatible avec les descripteurs classiques de la littérature qui reposent sur l’utilisation de points d’intérêts dédiés de sorte qu'ils puissent traiter la recherche et la localisation d'images de documents à contenu hétérogène. La seconde contribution de cette thèse porte sur un système d'indexation de très grands volumes de données à partir d’un descripteur volumineux. Ces deux contraintes viennent peser lourd sur la mémoire du système d’indexation. En outre, la très grande dimensionnalité des descripteurs peut amener à une réduction de la précision de l'indexation, réduction liée au problème de dimensionnalité. Nous proposons donc trois techniques d'indexation robustes, qui peuvent toutes être employées sans avoir besoin de stocker les descripteurs locaux dans la mémoire du système. Cela permet, in fine, d’économiser la mémoire et d’accélérer le temps de recherche de l’information, tout en s’abstrayant d’une validation de type distance. Pour cela, nous avons proposé trois méthodes s’appuyant sur des arbres de décisions : « randomized clustering tree indexing” qui hérite des propriétés des kd-tree, « kmean-tree » et les « random forest » afin de sélectionner de manière aléatoire les K dimensions qui permettent de combiner la plus grande variance expliquée pour chaque nœud de l’arbre. Nous avons également proposé une fonction de hachage étendue pour l'indexation de contenus hétérogènes provenant de plusieurs couches de l'image. Comme troisième contribution de cette thèse, nous avons proposé une méthode simple et robuste pour calculer l'orientation des régions obtenues par le détecteur MSER, afin que celui-ci puisse être combiné avec des descripteurs dédiés. Comme la plupart de ces descripteurs visent à capturer des informations de voisinage autour d’une région donnée, nous avons proposé un moyen d'étendre les régions MSER en augmentant le rayon de chaque région. Cette stratégie peut également être appliquée à d'autres régions détectées afin de rendre les descripteurs plus distinctifs. Enfin, afin d'évaluer les performances de nos contributions, et en nous fondant sur l'absence d'ensemble de données publiquement disponibles pour la localisation d’information hétérogène dans des images capturées par une caméra, nous avons construit trois jeux de données qui sont disponibles pour la communauté scientifique
This work aims at developing a generic framework which is able to produce camera-based applications of information spotting in huge repositories of heterogeneous content document images via local descriptors. The targeted systems may take as input a portion of an image acquired as a query and the system is capable of returning focused portion of database image that match the query best. We firstly propose a set of generic feature descriptors for camera-based document images retrieval and spotting systems. Our proposed descriptors comprise SRIF, PSRIF, DELTRIF and SSKSRIF that are built from spatial space information of nearest keypoints around a keypoints which are extracted from centroids of connected components. From these keypoints, the invariant geometrical features are considered to be taken into account for the descriptor. SRIF and PSRIF are computed from a local set of m nearest keypoints around a keypoint. While DELTRIF and SSKSRIF can fix the way to combine local shape description without using parameter via Delaunay triangulation formed from a set of keypoints extracted from a document image. Furthermore, we propose a framework to compute the descriptors based on spatial space of dedicated keypoints e.g SURF or SIFT or ORB so that they can deal with heterogeneous-content camera-based document image retrieval and spotting. In practice, a large-scale indexing system with an enormous of descriptors put the burdens for memory when they are stored. In addition, high dimension of descriptors can make the accuracy of indexing reduce. We propose three robust indexing frameworks that can be employed without storing local descriptors in the memory for saving memory and speeding up retrieval time by discarding distance validating. The randomized clustering tree indexing inherits kd-tree, kmean-tree and random forest from the way to select K dimensions randomly combined with the highest variance dimension from each node of the tree. We also proposed the weighted Euclidean distance between two data points that is computed and oriented the highest variance dimension. The secondly proposed hashing relies on an indexing system that employs one simple hash table for indexing and retrieving without storing database descriptors. Besides, we propose an extended hashing based method for indexing multi-kinds of features coming from multi-layer of the image. Along with proposed descriptors as well indexing frameworks, we proposed a simple robust way to compute shape orientation of MSER regions so that they can combine with dedicated descriptors (e.g SIFT, SURF, ORB and etc.) rotation invariantly. In the case that descriptors are able to capture neighborhood information around MSER regions, we propose a way to extend MSER regions by increasing the radius of each region. This strategy can be also applied for other detected regions in order to make descriptors be more distinctive. Moreover, we employed the extended hashing based method for indexing multi-kinds of features from multi-layer of images. This system are not only applied for uniform feature type but also multiple feature types from multi-layers separated. Finally, in order to assess the performances of our contributions, and based on the assessment that no public dataset exists for camera-based document image retrieval and spotting systems, we built a new dataset which has been made freely and publicly available for the scientific community. This dataset contains portions of document images acquired via a camera as a query. It is composed of three kinds of information: textual content, graphical content and heterogeneous content

APA, Harvard, Vancouver, ISO, and other styles

18

Wächter, Thomas. "Semi-automated Ontology Generation for Biocuration and Semantic Search." Doctoral thesis, Saechsische Landesbibliothek- Staats- und Universitaetsbibliothek Dresden, 2011. http://nbn-resolving.de/urn:nbn:de:bsz:14-qucosa-64838.

Full text

Abstract:

Background: In the life sciences, the amount of literature and experimental data grows at a tremendous rate. In order to effectively access and integrate these data, biomedical ontologies – controlled, hierarchical vocabularies – are being developed. Creating and maintaining such ontologies is a difficult, labour-intensive, manual process. Many computational methods which can support ontology construction have been proposed in the past. However, good, validated systems are largely missing. Motivation: The biocuration community plays a central role in the development of ontologies. Any method that can support their efforts has the potential to have a huge impact in the life sciences. Recently, a number of semantic search engines were created that make use of biomedical ontologies for document retrieval. To transfer the technology to other knowledge domains, suitable ontologies need to be created. One area where ontologies may prove particularly useful is the search for alternative methods to animal testing, an area where comprehensive search is of special interest to determine the availability or unavailability of alternative methods. Results: The Dresden Ontology Generator for Directed Acyclic Graphs (DOG4DAG) developed in this thesis is a system which supports the creation and extension of ontologies by semi-automatically generating terms, definitions, and parent-child relations from text in PubMed, the web, and PDF repositories. The system is seamlessly integrated into OBO-Edit and Protégé, two widely used ontology editors in the life sciences. DOG4DAG generates terms by identifying statistically significant noun-phrases in text. For definitions and parent-child relations it employs pattern-based web searches. Each generation step has been systematically evaluated using manually validated benchmarks. The term generation leads to high quality terms also found in manually created ontologies. Definitions can be retrieved for up to 78% of terms, child ancestor relations for up to 54%. No other validated system exists that achieves comparable results. To improve the search for information on alternative methods to animal testing an ontology has been developed that contains 17,151 terms of which 10% were newly created and 90% were re-used from existing resources. This ontology is the core of Go3R, the first semantic search engine in this field. When a user performs a search query with Go3R, the search engine expands this request using the structure and terminology of the ontology. The machine classification employed in Go3R is capable of distinguishing documents related to alternative methods from those which are not with an F-measure of 90% on a manual benchmark. Approximately 200,000 of the 19 million documents listed in PubMed were identified as relevant, either because a specific term was contained or due to the automatic classification. The Go3R search engine is available on-line under www.Go3R.org.

APA, Harvard, Vancouver, ISO, and other styles

19

Vidal-Santos, Gerard. "Avaluació de processos de reconeixement d’entitats (NER) com a complement a interfícies de recuperació d’informació en dipòsits digitals complexos." Thesis, 2018. http://eprints.rclis.org/33589/1/VidalSantos_TFG_2018.pdf.

Full text

Abstract:

The aim of the study is to explore the use of unsupervised Named-Entity Recognition (NER) processes to generate descriptive metadata capable to assist information retrieval interfaces in large-scaled digital collections and support the construction of of more diverse knowledge representation models in academic libraries. For this purpose the study reviews some experiences and canonical literature in the use of automatized subject headings creation in libraries and archives environments as a leveraging tool in the overexploited use of search engines as main access points to retrieve assets in their catalogs and digital collections., focusing on the guidelines established by two articles that address this task from two complementary points of view: • van Hooland S, de Wilde M, Verborgh R, Steiner T, Van de Walle R. Exploring entity recognition and disambiguation for cultural heritage collections. DIGITAL SCHOLARSHIP IN THE HUMANITIES. 2013 Nov 1;30(2):262–79. • Zeng M. Using a Semantic Analysis Tool to Generate Subject Access Points: A Study Using Panofsky’s Theory and Two Research Samples. Knowledge Organization. 2014 Jan 1;440–51. The first one, provides the tools to generate named-entities in large scale samples of text and establishes the parameters to assess the suitability of this entities in a quantitative level. The second one provides the guidelines to analyze the quality of those results by developing a 3 layered framework (identification-description-interpretation) based on Edward Panofsky’s work in the analysis and interpretation of pictorial works. A work environment is built on this premise to extract and analyze the entities detected by DBPedia Spotlight (the NER service used for extraction) in a random collection of bibliographic records extracted from a thesis aggregator (Open Acces Thesis & Dissertations). The results shows the great improve on descriptive access points provided by this processes at a quantitative basis, allowing users to browse more effectively in better contextualized records if combined with the keywords already indexed, despite not having the necessary consistency to successfully surpass the quality filter established in the evaluation table. This setback, however, conditions in a relative way the possibility of improving the visibility of record in large collections by these means if the logical constructions from the semantic basis that manages the extraction service is taken into consideration on iterative cataloging processes, establishing a iterative and cost-effective way of constructing more diverse maps of knowledge graphs connecting manual or self-generated indexed keywords to others nodes in the linked open data (LOD) cloud.

APA, Harvard, Vancouver, ISO, and other styles

20

Vidal-Santos, Gerard. "Avaluació de processos de reconeixement d’entitats (NER) com a complement a interfícies de recuperació d’informació en dipòsits digitals complexos." Thesis, 2018. http://eprints.rclis.org/33692/1/VidalSantos_TFG_2018.pdf.

Full text

Abstract:

The aim of the study is to explore the use of unsupervised Named-Entity Recognition (NER) processes to generate descriptive metadata capable to assist information retrieval interfaces in large-scaled digital collections and support the construction of of more diverse knowledge representation models in academic libraries. For this purpose the study reviews some experiences and canonical literature in the use of automatized subject headings creation in libraries and archives environments as a leveraging tool in the overexploited use of search engines as main access points to retrieve assets in their catalogs and digital collections., focusing on the guidelines established by two articles that address this task from two complementary points of view: • van Hooland S, de Wilde M, Verborgh R, Steiner T, Van de Walle R. Exploring entity recognition and disambiguation for cultural heritage collections. DIGITAL SCHOLARSHIP IN THE HUMANITIES. 2013 Nov 1;30(2):262–79. • Zeng M. Using a Semantic Analysis Tool to Generate Subject Access Points: A Study Using Panofsky’s Theory and Two Research Samples. Knowledge Organization. 2014 Jan 1;440–51. The first one, provides the tools to generate named-entities in large scale samples of text and establishes the parameters to assess the suitability of this entities in a quantitative level. The second one provides the guidelines to analyze the quality of those results by developing a 3 layered framework (identification-description-interpretation) based on Edward Panofsky’s work in the analysis and interpretation of pictorial works. A work environment is built on this premise to extract and analyze the entities detected by DBPedia Spotlight (the NER service used for extraction) in a random collection of bibliographic records extracted from a thesis aggregator (Open Acces Thesis & Dissertations). The results shows the great improve on descriptive access points provided by this processes at a quantitative basis, allowing users to browse more effectively in better contextualized records if combined with the keywords already indexed, despite not having the necessary consistency to successfully surpass the quality filter established in the evaluation table. This setback, however, conditions in a relative way the possibility of improving the visibility of record in large collections by these means if the logical constructions from the semantic basis that manages the extraction service is taken into consideration on iterative cataloging processes, establishing a iterative and cost-effective way of constructing more diverse maps of knowledge graphs connecting manual or self-generated indexed keywords to others nodes in the linked open data (LOD) cloud.

APA, Harvard, Vancouver, ISO, and other styles

21

Çapkın, Çağdaş. "Türkçe metin tabanlı açık arşivlerde kullanılan dizinleme yönteminin değerlendirilmesi / Evaluation of indexing method used in Turkish text-based open archives." Thesis, 2011. http://eprints.rclis.org/28804/1/Cagdas_CAPKIN_Yuksek_Lisans_tezi.pdf.

Full text

Abstract:

The purpose of this research is to evaluate performance of information retrieval systems designed for open archives, and standards/protocols enabling retrieving and organizing information in open archives. In this regard, an open archive was developed with 2215 text-based documents from "Turkish Librarianship" journal and three different information retrieval systems based on Boolean and Vector Space models were designed in order to evaluate information retrieval performances in the open archive developed. The designed information retrieval systems are: "metadata information retrieval system" (ÜBES) involving indexing with metadata created based only on human, "full-text information retrieval system" (TBES) involving (automatic) indexing based on only machine, and "mixed information retrieval system" (KBES) involving indexing based both on human and machine. Descriptive research method is used to describe the current situation and findings are evaluated based on literature. In order to evaluate performances of information retrieval systems, "precision and recall" and "normalized recall" measurements are made. The following results are found at the end of the research: It is determined that the precision performance of KBES information retrieval system designed for open archives creates statistically significant difference in comparison to ÜBES and TBES. In each information retrieval system, a strong negative correlation is identified between recall and precision, where precision decreases as recall increases. It is determined that the "normalized recall" performance of ÜBES and KBES create statistically significant difference in comparison to TBES. In "normalized recall" performance, no statistically significant difference is identified between ÜBES and KBES. ÜBES is the information retrieval system through which minimum number of relevant and nonrelevant documents; TBES, through which maximum number of nonrelevant and second most relevant documents, and KBES, through which maximum number of relevant and second most nonrelevant documents are retrieved. It is concluded that using OAI-PMH and OAI-ORE protocols together rather than using only OAI-PMH protocol fits the purpose of open archives.

APA, Harvard, Vancouver, ISO, and other styles

22

Sifuentes, Raul. "Determinación de las causas de búsquedas sin resultados en un catálogo bibliográfico en línea mediante el análisis de bitácoras de transacciones : el caso de la Pontificia Universidad Católica del Perú." Thesis, 2013. http://eprints.rclis.org/23857/1/SIFUENTES_ARROYO_RAUL_CATALOGO_BIBLIOGRAFICO.pdf.

Full text

Abstract:

The present investigation is aimed at determining the causes of searches with zero results in the OPAC of the Sistema de Bibliotecas Pontificia Universidad Catolica del Peru during 2012 year. For this purpose, a log transaction analysis of the OPAC´s searches was done as a methodology. Three causes of searches with zero result were found: 1) Terms mismatched: words or phrases written correctly in search statements which do not match those used in the bibliographic description and subject terminology assigned to each bibliographic record. 2) Wrong writing in search statements due to typographical and spelling mistakes. 3) Wrong index selection: when a user selects a wrong index for his/her search statement. A more detailed analysis was done for each OPAC's index. Suggestions are offered regarding to reinforcement in OPAC searches training, improve search engine interface, and use of log transactions analisys of OPAC´s searches as a methodology to detect library materials demand not in its physical and virtual collections.

APA, Harvard, Vancouver, ISO, and other styles

23

Moreira, Walter. "Biblioteca tradicional x biblioteca virtual: modelos de recuperação da informação." Thesis, 1998. http://eprints.rclis.org/8353/1/BibliotecaTradicionalXBibliotecaVirtual_ModelosDeRecuperacaoDaInformacao.pdf.

Full text

Abstract:

Taking libraries as "technologies of intelligence" the author develops a reflection where tradicional library is "opposed" to its virtual counterpart. Models of information retrieval are discussed within the two mentioned models of libraries. New virtual concepts as "navigation" or "surfing" are compared with those present in more traditional information retrieval systems, as "the best match". Here are some other opposite concepts discussed: fuzzy set theory opposed to boolean logic; the interrelation between the basis of information technology and the development of information retrieval tools are also under scrutinity. At last, it discusses the specificity of hypermidia environment which poses new questions to the development of Information Retrieval Theory. The radical differences perceived between tradicional and virtual libraries does not leads the author to a polarization when considering the future. Although digitalization is an irreversible tendency related to information world, the author concludes by the complementarity of the two models of libraries.

APA, Harvard, Vancouver, ISO, and other styles

24

Yusan, Wang, and 王愚善. "Automatic Text Corpora Retrieval in Example-Based Machine Translation." Thesis, 2001. http://ndltd.ncl.edu.tw/handle/67545771395405026129.

Full text

Abstract:

碩士
國立成功大學
資訊管理研究所
89
Translation is often a matter of finding analogous examples in linguistic databanks and of discovering how a particular source language has been translated before. Example-based approaches for machine translation (MT) are generally viewed as alternatives to knowledge-based methods and the supplement of traditional rule-based methods, of which three major steps — analysis, transfer and generation — are included. Researchers have shown many advantages in the use of example-based approach for translation, one of which allows the user to select text corpora for a specific domain for the sake of efficiency in matching and the better quality of translation to the target language. However, the selection of text corpora is generally accomplished by those, if not a translator, who must be equipped with a good capability to read the source language as well as to justify the category of the text. As a result, the monolingual user is unable to take the advantage, but has to experience much longer time during the text corpus matching in the bilingual (or multilingual) knowledge base. In this study, the proposed approach, namely Automatic Text Corpora Retrieval (ATCR), is able to automate the process to identify the corpora to which the source text is mostly related.

APA, Harvard, Vancouver, ISO, and other styles

25

"A concept-space based multi-document text summarizer." 2001. http://library.cuhk.edu.hk/record=b5890766.

Full text

Abstract:

by Tang Ting Kap.
Thesis (M.Phil.)--Chinese University of Hong Kong, 2001.
Includes bibliographical references (leaves 88-94).
Abstracts in English and Chinese.
List of Figures --- p.vi
List of Tables --- p.vii
Chapter 1. --- INTRODUCTION --- p.1
Chapter 1.1 --- Information Overloading and Low Utilization --- p.2
Chapter 1.2 --- Problem Needs To Solve --- p.3
Chapter 1.3 --- Research Contributions --- p.4
Chapter 1.3.1 --- Using Concept Space in Summarization --- p.5
Chapter 1.3.2 --- New Extraction Method --- p.5
Chapter 1.3.3 --- Experiments on New System --- p.6
Chapter 1.4 --- Organization of This Thesis --- p.7
Chapter 2. --- LITERATURE REVIEW --- p.8
Chapter 2.1 --- Classical Approach --- p.8
Chapter 2.1.1 --- Luhn's Algorithm --- p.9
Chapter 2.1.2 --- Edumundson's Algorithm --- p.11
Chapter 2.2 --- Statistical Approach --- p.15
Chapter 2.3 --- Natural Language Processing Approach --- p.15
Chapter 3. --- PROPOSED SUMMARIZATION APPROACH --- p.18
Chapter 3.1 --- Direction of Summarization --- p.19
Chapter 3.2 --- Overview of Summarization Algorithm --- p.20
Chapter 3.2.1 --- Document Pre-processing --- p.21
Chapter 3.2.2 --- Vector Space Model --- p.23
Chapter 3.2.3 --- Sentence Extraction --- p.24
Chapter 3.3 --- Evaluation Method --- p.25
Chapter 3.3.1 --- "Recall, Precision and F-measure" --- p.25
Chapter 3.4 --- Advantage of Concept Space Approach --- p.26
Chapter 4. --- SYSTEM ARCHITECTURE --- p.27
Chapter 4.1 --- Converge Process --- p.28
Chapter 4.2 --- Diverge Process --- p.30
Chapter 4.3 --- Backward Search --- p.31
Chapter 5. --- CONVERGE PROCESS --- p.32
Chapter 5.1 --- Document Merging --- p.32
Chapter 5.2 --- Word Phrase Extraction --- p.34
Chapter 5.3 --- Automatic Indexing --- p.34
Chapter 5.4 --- Cluster Analysis --- p.35
Chapter 5.5 --- Hopfield Net Classification --- p.37
Chapter 6. --- DIVERGE PROCESS --- p.42
Chapter 6.1 --- Concept Terms Refinement --- p.42
Chapter 6.2 --- Sentence Selection --- p.43
Chapter 6.3 --- Backward Searching --- p.46
Chapter 7. --- EXPERIMENT AND RESEARCH FINDINGS --- p.48
Chapter 7.1 --- System-generated Summary v.s. Source Documents --- p.52
Chapter 7.1.1 --- Compression Ratio --- p.52
Chapter 7.1.2 --- Information Loss --- p.54
Chapter 7.2 --- System-generated Summary v.s. Human-generated Summary --- p.58
Chapter 7.2.1 --- Background of EXTRACTOR --- p.59
Chapter 7.2.2 --- Evaluation Method --- p.61
Chapter 7.3 --- Evaluation of different System-generated Summaries by Human Experts --- p.63
Chapter 8. --- CONCLUSIONS AND FUTURE RESEARCH --- p.68
Chapter 8.1 --- Conclusions --- p.68
Chapter 8.2 --- Future Work --- p.69
Chapter A. --- EXTRACTOR SYSTEM FLOW AND TEN-STEP PROCEDURE --- p.71
Chapter B. --- SUMMARY GENERATED BY MS WORD2000 --- p.75
Chapter C. --- SUMMARY GENERATED BY EXTRACTOR SOFTWARE --- p.76
Chapter D. --- SUMMARY GENERATED BY OUR SYSTEM --- p.77
Chapter E. --- SYSTEM-GENERATED WORD PHRASES FROM TEST SAMPLE --- p.78
Chapter F. --- WORD PHRASES IDENTIFIED BY SUBJECTS --- p.79
Chapter G. --- SAMPLE OF QUESTIONNAIRE --- p.84
Chapter H. --- RESULT OF QUESTIONNAIRE --- p.85
Chapter I. --- EVALUATION FOR DIVERGE PROCESS --- p.86
BIBLIOGRAPHY --- p.88

APA, Harvard, Vancouver, ISO, and other styles

26

"A probabilistic approach for automatic text filtering." 1998. http://library.cuhk.edu.hk/record=b5889506.

Full text

Abstract:

Low Kon Fan.
Thesis (M.Phil.)--Chinese University of Hong Kong, 1998.
Includes bibliographical references (leaves 165-168).
Abstract also in Chinese.
Abstract --- p.i
Acknowledgment --- p.iv
Chapter 1 --- Introduction --- p.1
Chapter 1.1 --- Overview of Information Filtering --- p.1
Chapter 1.2 --- Contributions --- p.4
Chapter 1.3 --- Organization of this thesis --- p.6
Chapter 2 --- Existing Approaches --- p.7
Chapter 2.1 --- Representational issues --- p.7
Chapter 2.1.1 --- Document Representation --- p.7
Chapter 2.1.2 --- Feature Selection --- p.11
Chapter 2.2 --- Traditional Approaches --- p.15
Chapter 2.2.1 --- NewsWeeder --- p.15
Chapter 2.2.2 --- NewT --- p.17
Chapter 2.2.3 --- SIFT --- p.19
Chapter 2.2.4 --- InRoute --- p.20
Chapter 2.2.5 --- Motivation of Our Approach --- p.21
Chapter 2.3 --- Probabilistic Approaches --- p.23
Chapter 2.3.1 --- The Naive Bayesian Approach --- p.25
Chapter 2.3.2 --- The Bayesian Independence Classifier Approach --- p.28
Chapter 2.4 --- Comparison --- p.31
Chapter 3 --- Our Bayesian Network Approach --- p.33
Chapter 3.1 --- Backgrounds of Bayesian Networks --- p.34
Chapter 3.2 --- Bayesian Network Induction Approach --- p.36
Chapter 3.3 --- Automatic Construction of Bayesian Networks --- p.38
Chapter 4 --- Automatic Feature Discretization --- p.50
Chapter 4.1 --- Predefined Level Discretization --- p.52
Chapter 4.2 --- Lloyd's algorithm . . > --- p.53
Chapter 4.3 --- Class Dependence Discretization --- p.55
Chapter 5 --- Experiments and Results --- p.59
Chapter 5.1 --- Document Collections --- p.60
Chapter 5.2 --- Batch Filtering Experiments --- p.63
Chapter 5.3 --- Batch Filtering Results --- p.65
Chapter 5.4 --- Incremental Session Filtering Experiments --- p.87
Chapter 5.5 --- Incremental Session Filtering Results --- p.88
Chapter 6 --- Conclusions and Future Work --- p.105
Appendix A --- p.107
Appendix B --- p.116
Appendix C --- p.126
Appendix D --- p.131
Appendix E --- p.145

APA, Harvard, Vancouver, ISO, and other styles

27

"New learning strategies for automatic text categorization." 2001. http://library.cuhk.edu.hk/record=b5890838.

Full text

Abstract:

Lai Kwok-yin.
Thesis (M.Phil.)--Chinese University of Hong Kong, 2001.
Includes bibliographical references (leaves 125-130).
Abstracts in English and Chinese.
Chapter 1 --- Introduction --- p.1
Chapter 1.1 --- Automatic Textual Document Categorization --- p.1
Chapter 1.2 --- Meta-Learning Approach For Text Categorization --- p.3
Chapter 1.3 --- Contributions --- p.6
Chapter 1.4 --- Organization of the Thesis --- p.7
Chapter 2 --- Related Work --- p.9
Chapter 2.1 --- Existing Automatic Document Categorization Approaches --- p.9
Chapter 2.2 --- Existing Meta-Learning Approaches For Information Retrieval --- p.14
Chapter 2.3 --- Our Meta-Learning Approaches --- p.20
Chapter 3 --- Document Pre-Processing --- p.22
Chapter 3.1 --- Document Representation --- p.22
Chapter 3.2 --- Classification Scheme Learning Strategy --- p.25
Chapter 4 --- Linear Combination Approach --- p.30
Chapter 4.1 --- Overview --- p.30
Chapter 4.2 --- Linear Combination Approach - The Algorithm --- p.33
Chapter 4.2.1 --- Equal Weighting Strategy --- p.34
Chapter 4.2.2 --- Weighting Strategy Based On Utility Measure --- p.34
Chapter 4.2.3 --- Weighting Strategy Based On Document Rank --- p.35
Chapter 4.3 --- Comparisons of Linear Combination Approach and Existing Meta-Learning Methods --- p.36
Chapter 4.3.1 --- LC versus Simple Majority Voting --- p.36
Chapter 4.3.2 --- LC versus BORG --- p.38
Chapter 4.3.3 --- LC versus Restricted Linear Combination Method --- p.38
Chapter 5 --- The New Meta-Learning Model - MUDOF --- p.40
Chapter 5.1 --- Overview --- p.41
Chapter 5.2 --- Document Feature Characteristics --- p.42
Chapter 5.3 --- Classification Errors --- p.44
Chapter 5.4 --- Linear Regression Model --- p.45
Chapter 5.5 --- The MUDOF Algorithm --- p.47
Chapter 6 --- Incorporating MUDOF into Linear Combination approach --- p.52
Chapter 6.1 --- Background --- p.52
Chapter 6.2 --- Overview of MUDOF2 --- p.54
Chapter 6.3 --- Major Components of the MUDOF2 --- p.57
Chapter 6.4 --- The MUDOF2 Algorithm --- p.59
Chapter 7 --- Experimental Setup --- p.66
Chapter 7.1 --- Document Collection --- p.66
Chapter 7.2 --- Evaluation Metric --- p.68
Chapter 7.3 --- Component Classification Algorithms --- p.71
Chapter 7.4 --- Categorical Document Feature Characteristics for MUDOF and MUDOF2 --- p.72
Chapter 8 --- Experimental Results and Analysis --- p.74
Chapter 8.1 --- Performance of Linear Combination Approach --- p.74
Chapter 8.2 --- Performance of the MUDOF Approach --- p.78
Chapter 8.3 --- Performance of MUDOF2 Approach --- p.87
Chapter 9 --- Conclusions and Future Work --- p.96
Chapter 9.1 --- Conclusions --- p.96
Chapter 9.2 --- Future Work --- p.98
Chapter A --- Details of Experimental Results for Reuters-21578 corpus --- p.99
Chapter B --- Details of Experimental Results for OHSUMED corpus --- p.114
Bibliography --- p.125

APA, Harvard, Vancouver, ISO, and other styles

28

"Automatic text categorization for information filtering." 1998. http://library.cuhk.edu.hk/record=b5889734.

Full text

Abstract:

Ho Chao Yang.
Thesis (M.Phil.)--Chinese University of Hong Kong, 1998.
Includes bibliographical references (leaves 157-163).
Abstract also in Chinese.
Abstract --- p.i
Acknowledgment --- p.iii
List of Figures --- p.viii
List of Tables --- p.xiv
Chapter 1 --- Introduction --- p.1
Chapter 1.1 --- Automatic Document Categorization --- p.1
Chapter 1.2 --- Information Filtering --- p.3
Chapter 1.3 --- Contributions --- p.6
Chapter 1.4 --- Organization of the Thesis --- p.7
Chapter 2 --- Related Work --- p.9
Chapter 2.1 --- Existing Automatic Document Categorization Approaches --- p.9
Chapter 2.1.1 --- Rule-Based Approach --- p.10
Chapter 2.1.2 --- Similarity-Based Approach --- p.13
Chapter 2.2 --- Existing Information Filtering Approaches --- p.19
Chapter 2.2.1 --- Information Filtering Systems --- p.19
Chapter 2.2.2 --- Filtering in TREC --- p.21
Chapter 3 --- Document Pre-Processing --- p.23
Chapter 3.1 --- Document Representation --- p.23
Chapter 3.2 --- Classification Scheme Learning Strategy --- p.26
Chapter 4 --- A New Approach - IBRI --- p.31
Chapter 4.1 --- Overview of Our New IBRI Approach --- p.31
Chapter 4.2 --- The IBRI Representation and Definitions --- p.34
Chapter 4.3 --- The IBRI Learning Algorithm --- p.37
Chapter 5 --- IBRI Experiments --- p.43
Chapter 5.1 --- Experimental Setup --- p.43
Chapter 5.2 --- Evaluation Metric --- p.45
Chapter 5.3 --- Results --- p.46
Chapter 6 --- A New Approach - GIS --- p.50
Chapter 6.1 --- Motivation of GIS --- p.50
Chapter 6.2 --- Similarity-Based Learning --- p.51
Chapter 6.3 --- The Generalized Instance Set Algorithm (GIS) --- p.58
Chapter 6.4 --- Using GIS Classifiers for Classification --- p.63
Chapter 6.5 --- Time Complexity --- p.64
Chapter 7 --- GIS Experiments --- p.68
Chapter 7.1 --- Experimental Setup --- p.68
Chapter 7.2 --- Results --- p.73
Chapter 8 --- A New Information Filtering Approach Based on GIS --- p.87
Chapter 8.1 --- Information Filtering Systems --- p.87
Chapter 8.2 --- GIS-Based Information Filtering --- p.90
Chapter 9 --- Experiments on GIS-based Information Filtering --- p.95
Chapter 9.1 --- Experimental Setup --- p.95
Chapter 9.2 --- Results --- p.100
Chapter 10 --- Conclusions and Future Work --- p.108
Chapter 10.1 --- Conclusions --- p.108
Chapter 10.2 --- Future Work --- p.110
Chapter A --- Sample Documents in the corpora --- p.111
Chapter B --- Details of Experimental Results of GIS --- p.120
Chapter C --- Computational Time of Reuters-21578 Experiments --- p.141

APA, Harvard, Vancouver, ISO, and other styles

29

Kuan-MingChou and 周冠銘. "Using automatic keywords extraction and text clustering methods for medical information retrieval improvement." Thesis, 2013. http://ndltd.ncl.edu.tw/handle/80362319360586009723.

Full text

Abstract:

碩士
國立成功大學
醫學資訊研究所
101
Because there are huge data on the web, it will get many duplicate and near-duplicate search results when we search on the web. The motivation of this thesis is that reduce the time of filtering the huge duplicate and near-duplicate information when user search. In this thesis, we propose a novel clustering method to solve near-duplicate problem. Our method transforms each document to a feature vector, where the weights are terms frequency of each corresponding words. For reducing the dimension of these feature vectors, we used principle component analysis to transform these vectors to another space. After PCA, we used cosine similarity to compute the similarity of each document. And then, we used EM algorithm and Neyman-Pearson hypothesis test to cluster the duplicate documents. We compared out results with K-means method results. The experiments show that our method is outperformer than K-means method.

APA, Harvard, Vancouver, ISO, and other styles

30

"Automatic index generation for the free-text based database." Chinese University of Hong Kong, 1992. http://library.cuhk.edu.hk/record=b5887040.

Full text

Abstract:

by Leung Chi Hong.
Thesis (M.Phil.)--Chinese University of Hong Kong, 1992.
Includes bibliographical references (leaves 183-184).
Chapter Chapter one: --- Introduction --- p.1
Chapter Chapter two: --- Background knowledge and linguistic approaches of automatic indexing --- p.5
Chapter 2.1 --- Definition of index and indexing --- p.5
Chapter 2.2 --- Indexing methods and problems --- p.7
Chapter 2.3 --- Automatic indexing and human indexing --- p.8
Chapter 2.4 --- Different approaches of automatic indexing --- p.10
Chapter 2.5 --- Example of semantic approach --- p.11
Chapter 2.6 --- Example of syntactic approach --- p.14
Chapter 2.7 --- Comments on semantic and syntactic approaches --- p.18
Chapter Chapter three: --- Rationale and methodology of automatic index generation --- p.19
Chapter 3.1 --- Problems caused by natural language --- p.19
Chapter 3.2 --- Usage of word frequencies --- p.20
Chapter 3.3 --- Brief description of rationale --- p.24
Chapter 3.4 --- Automatic index generation --- p.27
Chapter 3.4.1 --- Training phase --- p.27
Chapter 3.4.1.1 --- Selection of training documents --- p.28
Chapter 3.4.1.2 --- Control and standardization of variants of words --- p.28
Chapter 3.4.1.3 --- Calculation of associations between words and indexes --- p.30
Chapter 3.4.1.4 --- Discarding false associations --- p.33
Chapter 3.4.2 --- Indexing phase --- p.38
Chapter 3.4.3 --- Example of automatic indexing --- p.41
Chapter 3.5 --- Related researches --- p.44
Chapter 3.6 --- Word diversity and its effect on automatic indexing --- p.46
Chapter 3.7 --- Factors affecting performance of automatic indexing --- p.60
Chapter 3.8 --- Application of semantic representation --- p.61
Chapter 3.8.1 --- Problem of natural language --- p.61
Chapter 3.8.2 --- Use of concept headings --- p.62
Chapter 3.8.3 --- Example of using concept headings in automatic indexing --- p.65
Chapter 3.8.4 --- Advantages of concept headings --- p.68
Chapter 3.8.5 --- Disadvantages of concept headings --- p.69
Chapter 3.9 --- Correctness prediction for proposed indexes --- p.78
Chapter 3.9.1 --- Example of using index proposing rate --- p.80
Chapter 3.10 --- Effect of subject matter on automatic indexing --- p.83
Chapter 3.11 --- Comparison with other indexing methods --- p.85
Chapter 3.12 --- Proposal for applying Chinese medical knowledge --- p.90
Chapter Chapter four: --- Simulations of automatic index generation --- p.93
Chapter 4.1 --- Training phase simulations --- p.93
Chapter 4.1.1 --- Simulation of association calculation (word diversity uncontrolled) --- p.94
Chapter 4.1.2 --- Simulation of association calculation (word diversity controlled) --- p.102
Chapter 4.1.3 --- Simulation of discarding false associations --- p.107
Chapter 4.2 --- Indexing phase simulation --- p.115
Chapter 4.3 --- Simulation of using concept headings --- p.120
Chapter 4.4 --- Simulation for testing performance of predicting index correctness --- p.125
Chapter 4.5 --- Summary --- p.128
Chapter Chapter five: --- Real case study in database of Chinese Medicinal Material Research Center --- p.130
Chapter 5.1 --- Selection of real documents --- p.130
Chapter 5.2 --- Case study one: Overall performance using real data --- p.132
Chapter 5.2.1 --- Sample results of automatic indexing for real documents --- p.138
Chapter 5.3 --- Case study two: Using multi-word terms --- p.148
Chapter 5.4 --- Case study three: Using concept headings --- p.152
Chapter 5.5 --- Case study four: Prediction of proposed index correctness --- p.156
Chapter 5.6 --- Case study five: Use of (Σ ΔRij) Fi to determine false association --- p.159
Chapter 5.7 --- Case study six: Effect of word diversity --- p.162
Chapter 5.8 --- Summary --- p.166
Chapter Chapter six: --- Conclusion --- p.168
Appendix A: List of stopwords --- p.173
Appendix B: Index terms used in case studies --- p.174
References --- p.183

APA, Harvard, Vancouver, ISO, and other styles

31

(9761117), Shayan Ali A. Akbar. "Source code search for automatic bug localization." Thesis, 2020.

Find full text

Abstract:

This dissertation advances the state-of-the-art in information retrieval (IR) based automatic bug localization for large software systems. We present techniques from three generations of IR based bug localization and compare their performances on our large and diverse bug localization dataset --- the Bugzbook dataset. The three generations span over fifteen years of research in mining software repositories for bug localization and include: (1) the generation of simple bag-of-words (BoW) based techniques, (2) the generation in which software-centric information such as bug and code change histories as well as structured information embedded in bug reports and code files are exploited to improve retrieval, and (3) the third and most recent generation in which order and semantic relationships between terms are modeled to improve the performance of bug localization systems. The dissertation also presents a novel technique called SCOR (Source Code Retrieval with Semantics and Order) which combines Markov Random Fields (MRF) based term-term ordering dependencies with semantic word vectors obtained from neural network based word embedding algorithms, such as word2vec, to better localize bugs in code files. The results presented in this dissertation show that while term-term ordering and semantic relationships significantly improve the performance when they are modeled separately in retrieval systems, the best precisions in retrieval are obtained when they are modeled together in a single retrieval system. We also show that the semantic representations of software terms learned by training the word embedding algorithm on a corpus of software repositories can be used to perform search in new software code repositories not present in the training corpus of the word embedding algorithm.

APA, Harvard, Vancouver, ISO, and other styles

32

Khoo, Christopher S. G. "Automatic identification of causal relations in text and their use for improving precision in information retrieval." Thesis, 1995. http://hdl.handle.net/10150/105106.

Full text

Abstract:

Parts of the thesis were published in: 1. Khoo, C., Myaeng, S.H., & Oddy, R. (2001). Using cause-effect relations in text to improve information retrieval precision. Information Processing and Management, 37(1), 119-145. 2. Khoo, C., Kornfilt, J., Oddy, R., & Myaeng, S.H. (1998). Automatic extraction of cause-effect information from newspaper text without knowledge-based inferencing. Literary & Linguistic Computing, 13(4), 177-186. 3. Khoo, C. (1997). The use of relation matching in information retrieval. LIBRES: Library and Information Science Research Electronic Journal [Online], 7(2). Available at: http://aztec.lib.utk.edu/libres/libre7n2/. An update of the literature review on causal relations in text was published in: Khoo, C., Chan, S., & Niu, Y. (2002). The many facets of the cause-effect relation. In R.Green, C.A. Bean & S.H. Myaeng (Eds.), The semantics of relationships: An interdisciplinary perspective (pp. 51-70). Dordrecht: Kluwer
This study represents one attempt to make use of relations expressed in text to improve information retrieval effectiveness. In particular, the study investigated whether the information obtained by matching causal relations expressed in documents with the causal relations expressed in users' queries could be used to improve document retrieval results in comparison to using just term matching without considering relations. An automatic method for identifying and extracting cause-effect information in Wall Street Journal text was developed. The method uses linguistic clues to identify causal relations without recourse to knowledge-based inferencing. The method was successful in identifying and extracting about 68% of the causal relations that were clearly expressed within a sentence or between adjacent sentences in Wall Street Journal text. Of the instances that the computer program identified as causal relations, 72% can be considered to be correct. The automatic method was used in an experimental information retrieval system to identify causal relations in a database of full-text Wall Street Journal documents. Causal relation matching was found to yield a small but significant improvement in retrieval results when the weights used for combining the scores from different types of matching were customized for each query -- as in an SDI or routing queries situation. The best results were obtained when causal relation matching was combined with word proximity matching (matching pairs of causally related words in the query with pairs of words that co-occur within document sentences). An analysis using manually identified causal relations indicate that bigger retrieval improvements can be expected with more accurate identification of causal relations. The best kind of causal relation matching was found to be one in which one member of the causal relation (either the cause or the effect) was represented as a wildcard that could match with any term. The study also investigated whether using Roget's International Thesaurus (3rd ed.) to expand query terms with synonymous and related terms would improve retrieval effectiveness. Using Roget category codes in addition to keywords did give better retrieval results. However, the Roget codes were better at identifying the non-relevant documents than the relevant ones. Parts of the thesis were published in: 1. Khoo, C., Myaeng, S.H., & Oddy, R. (2001). Using cause-effect relations in text to improve information retrieval precision. Information Processing and Management, 37(1), 119-145. 2. Khoo, C., Kornfilt, J., Oddy, R., & Myaeng, S.H. (1998). Automatic extraction of cause-effect information from newspaper text without knowledge-based inferencing. Literary & Linguistic Computing, 13(4), 177-186. 3. Khoo, C. (1997). The use of relation matching in information retrieval. LIBRES: Library and Information Science Research Electronic Journal [Online], 7(2). Available at: http://aztec.lib.utk.edu/libres/libre7n2/. An update of the literature review on causal relations in text was published in: Khoo, C., Chan, S., & Niu, Y. (2002). The many facets of the cause-effect relation. In R.Green, C.A. Bean & S.H. Myaeng (Eds.), The semantics of relationships: An interdisciplinary perspective (pp. 51-70). Dordrecht: Kluwer

APA, Harvard, Vancouver, ISO, and other styles

33

Williams, Kyle. "Learning to Read Bushman: Automatic Handwriting Recognition for Bushman Languages." Thesis, 2012. http://pubs.cs.uct.ac.za/archive/00000791/.

Full text

Abstract:

The Bleek and Lloyd Collection contains notebooks that document the tradition, language and culture of the Bushman people who lived in South Africa in the late 19th century. Transcriptions of these notebooks would allow for the provision of services such as text-based search and text-to-speech. However, these notebooks are currently only available in the form of digital scans and the manual creation of transcriptions is a costly and time-consuming process. Thus, automatic methods could serve as an alternative approach to creating transcriptions of the text in the notebooks. In order to evaluate the use of automatic methods, a corpus of Bushman texts and their associated transcriptions was created. The creation of this corpus involved: the development of a custom method for encoding the Bushman script, which contains complex diacritics; the creation of a tool for creating and transcribing the texts in the notebooks; and the running of a series of workshops in which the tool was used to create the corpus. The corpus was used to evaluate the use of various techniques for automatically transcribing the texts in the corpus in order to determine which approaches were best suited to the complex Bushman script. These techniques included the use of Support Vector Machines, Artificial Neural Networks and Hidden Markov Models as machine learning algorithms, which were coupled with different descriptive features. The effect of the texts used for training the machine learning algorithms was also investigated as well as the use of a statistical language model. It was found that, for Bushman word recognition, the use of a Support Vector Machine with Histograms of Oriented Gradient features resulted in the best performance and, for Bushman text line recognition, Marti & Bunke features resulted in the best performance when used with Hidden Markov Models. The automatic transcription of the Bushman texts proved to be difficult and the performance of the different recognition systems was largely affected by the complexities of the Bushman script. It was also found that, besides having an influence on determining which techniques may be the most appropriate for automatic handwriting recognition, the texts used in a automatic handwriting recognition system also play a large role in determining whether or not automatic recognition should be attempted at all.

APA, Harvard, Vancouver, ISO, and other styles

34

"Automatic construction of wrappers for semi-structured documents." 2001. http://library.cuhk.edu.hk/record=b5890663.

Full text

Abstract:

Lin Wai-yip.
Thesis (M.Phil.)--Chinese University of Hong Kong, 2001.
Includes bibliographical references (leaves 114-123).
Abstracts in English and Chinese.
Chapter 1 --- Introduction --- p.1
Chapter 1.1 --- Information Extraction --- p.1
Chapter 1.2 --- IE from Semi-structured Documents --- p.3
Chapter 1.3 --- Thesis Contributions --- p.7
Chapter 1.4 --- Thesis Organization --- p.9
Chapter 2 --- Related Work --- p.11
Chapter 2.1 --- Existing Approaches --- p.11
Chapter 2.2 --- Limitations of Existing Approaches --- p.18
Chapter 2.3 --- Our HISER Approach --- p.20
Chapter 3 --- System Overview --- p.23
Chapter 3.1 --- Hierarchical record Structure and Extraction Rule learning (HISER) --- p.23
Chapter 3.2 --- Hierarchical Record Structure --- p.29
Chapter 3.3 --- Extraction Rule --- p.29
Chapter 3.4 --- Wrapper Adaptation --- p.32
Chapter 4 --- Automatic Hierarchical Record Structure Construction --- p.34
Chapter 4.1 --- Motivation --- p.34
Chapter 4.2 --- Hierarchical Record Structure Representation --- p.36
Chapter 4.3 --- Constructing Hierarchical Record Structure --- p.38
Chapter 5 --- Extraction Rule Induction --- p.43
Chapter 5.1 --- Rule Representation --- p.43
Chapter 5.2 --- Extraction Rule Induction Algorithm --- p.47
Chapter 6 --- Experimental Results of Wrapper Learning --- p.54
Chapter 6.1 --- Experimental Methodology --- p.54
Chapter 6.2 --- Results on Electronic Appliance Catalogs --- p.56
Chapter 6.3 --- Results on Book Catalogs --- p.60
Chapter 6.4 --- Results on Seminar Announcements --- p.62
Chapter 7 --- Adapting Wrappers to Unseen Information Sources --- p.69
Chapter 7.1 --- Motivation --- p.69
Chapter 7.2 --- Support Vector Machines --- p.72
Chapter 7.3 --- Feature Selection --- p.76
Chapter 7.4 --- Automatic Annotation of Training Examples --- p.80
Chapter 7.4.1 --- Building SVM Models --- p.81
Chapter 7.4.2 --- Seeking Potential Training Example Candidates --- p.82
Chapter 7.4.3 --- Classifying Potential Training Examples --- p.84
Chapter 8 --- Experimental Results of Wrapper Adaptation --- p.86
Chapter 8.1 --- Experimental Methodology --- p.86
Chapter 8.2 --- Results on Electronic Appliance Catalogs --- p.89
Chapter 8.3 --- Results on Book Catalogs --- p.93
Chapter 9 --- Conclusions and Future Work --- p.97
Chapter 9.1 --- Conclusions --- p.97
Chapter 9.2 --- Future Work --- p.100
Chapter A --- Sample Experimental Pages --- p.101
Chapter B --- Detailed Experimental Results of Wrapper Adaptation of HISER --- p.109
Bibliography --- p.114

APA, Harvard, Vancouver, ISO, and other styles

35

"Automatic construction and adaptation of wrappers for semi-structured web documents." 2003. http://library.cuhk.edu.hk/record=b5891460.

Full text

Abstract:

Wong Tak Lam.
Thesis (M.Phil.)--Chinese University of Hong Kong, 2003.
Includes bibliographical references (leaves 88-94).
Abstracts in English and Chinese.
Chapter 1 --- Introduction --- p.1
Chapter 1.1 --- Wrapper Induction for Semi-structured Web Documents --- p.1
Chapter 1.2 --- Adapting Wrappers to Unseen Web Sites --- p.6
Chapter 1.3 --- Thesis Contributions --- p.7
Chapter 1.4 --- Thesis Organization --- p.8
Chapter 2 --- Related Work --- p.10
Chapter 2.1 --- Related Work on Wrapper Induction --- p.10
Chapter 2.2 --- Related Work on Wrapper Adaptation --- p.16
Chapter 3 --- Automatic Construction of Hierarchical Wrappers --- p.20
Chapter 3.1 --- Hierarchical Record Structure Inference --- p.22
Chapter 3.2 --- Extraction Rule Induction --- p.30
Chapter 3.3 --- Applying Hierarchical Wrappers --- p.38
Chapter 4 --- Experimental Results for Wrapper Induction --- p.40
Chapter 5 --- Adaptation of Wrappers for Unseen Web Sites --- p.52
Chapter 5.1 --- Problem Definition --- p.52
Chapter 5.2 --- Overview of Wrapper Adaptation Framework --- p.55
Chapter 5.3 --- Potential Training Example Candidate Identification --- p.58
Chapter 5.3.1 --- Useful Text Fragments --- p.58
Chapter 5.3.2 --- Training Example Generation from the Unseen Web Site --- p.60
Chapter 5.3.3 --- Modified Nearest Neighbour Classification --- p.63
Chapter 5.4 --- Machine Annotated Training Example Discovery and New Wrap- per Learning --- p.64
Chapter 5.4.1 --- Text Fragment Classification --- p.64
Chapter 5.4.2 --- New Wrapper Learning --- p.69
Chapter 6 --- Case Study and Experimental Results for Wrapper Adapta- tion --- p.71
Chapter 6.1 --- Case Study on Wrapper Adaptation --- p.71
Chapter 6.2 --- Experimental Results --- p.73
Chapter 6.2.1 --- Book Domain --- p.74
Chapter 6.2.2 --- Consumer Electronic Appliance Domain --- p.79
Chapter 7 --- Conclusions and Future Work --- p.83
Bibliography --- p.88
Chapter A --- Detailed Performance of Wrapper Induction for Book Do- main --- p.95
Chapter B --- Detailed Performance of Wrapper Induction for Consumer Electronic Appliance Domain --- p.99

APA, Harvard, Vancouver, ISO, and other styles

36

"Statistical modeling for lexical chains for automatic Chinese news story segmentation." 2010. http://library.cuhk.edu.hk/record=b5894500.

Full text

Abstract:

Chan, Shing Kai.
Thesis (M.Phil.)--Chinese University of Hong Kong, 2010.
Includes bibliographical references (leaves 106-114).
Abstracts in English and Chinese.
Abstract --- p.i
Acknowledgements --- p.v
Chapter 1 --- Introduction --- p.1
Chapter 1.1 --- Problem Statement --- p.2
Chapter 1.2 --- Motivation for Story Segmentation --- p.4
Chapter 1.3 --- Terminologies --- p.5
Chapter 1.4 --- Thesis Goals --- p.6
Chapter 1.5 --- Thesis Organization --- p.8
Chapter 2 --- Background Study --- p.9
Chapter 2.1 --- Coherence-based Approaches --- p.10
Chapter 2.1.1 --- Defining Coherence --- p.10
Chapter 2.1.2 --- Lexical Chaining --- p.12
Chapter 2.1.3 --- Cosine Similarity --- p.15
Chapter 2.1.4 --- Language Modeling --- p.19
Chapter 2.2 --- Feature-based Approaches --- p.21
Chapter 2.2.1 --- Lexical Cues --- p.22
Chapter 2.2.2 --- Audio Cues --- p.23
Chapter 2.2.3 --- Video Cues --- p.24
Chapter 2.3 --- Pros and Cons and Hybrid Approaches --- p.25
Chapter 2.4 --- Chapter Summary --- p.27
Chapter 3 --- Experimental Corpora --- p.29
Chapter 3.1 --- The TDT2 and TDT3 Multi-language Text Corpus --- p.29
Chapter 3.1.1 --- Introduction --- p.29
Chapter 3.1.2 --- Program Particulars and Structures --- p.31
Chapter 3.2 --- Data Preprocessing --- p.33
Chapter 3.2.1 --- Challenges of Lexical Chain Formation on Chi- nese Text --- p.33
Chapter 3.2.2 --- Word Segmentation for Word Units Extraction --- p.35
Chapter 3.2.3 --- Part-of-speech Tagging for Candidate Words Ex- traction --- p.36
Chapter 3.3 --- Chapter Summary --- p.37
Chapter 4 --- Indication of Lexical Cohesiveness by Lexical Chains --- p.39
Chapter 4.1 --- Lexical Chain as a Representation of Cohesiveness --- p.40
Chapter 4.1.1 --- Choice of Word Relations for Lexical Chaining --- p.41
Chapter 4.1.2 --- Lexical Chaining by Connecting Repeated Lexi- cal Elements --- p.43
Chapter 4.2 --- Lexical Chain as an Indicator of Story Segments --- p.48
Chapter 4.2.1 --- Indicators of Absence of Cohesiveness --- p.49
Chapter 4.2.2 --- Indicator of Continuation of Cohesiveness --- p.58
Chapter 4.3 --- Chapter Summary --- p.62
Chapter 5 --- Indication of Story Boundaries by Lexical Chains --- p.63
Chapter 5.1 --- Formal Definition of the Classification Procedures --- p.64
Chapter 5.2 --- Theoretical Framework for Segmentation Based on Lex- ical Chaining --- p.65
Chapter 5.2.1 --- Evaluation of Story Segmentation Accuracy --- p.65
Chapter 5.2.2 --- Previous Approach of Story Segmentation Based on Lexical Chaining --- p.66
Chapter 5.2.3 --- Statistical Framework for Story Segmentation based on Lexical Chaining --- p.69
Chapter 5.2.4 --- Post Processing of Ratio for Boundary Identifi- cation --- p.73
Chapter 5.3 --- Comparing Segmentation Models --- p.75
Chapter 5.4 --- Chapter Summary --- p.79
Chapter 6 --- Analysis of Lexical Chains Features as Boundary Indi- cators --- p.80
Chapter 6.1 --- Error Analysis --- p.81
Chapter 6.2 --- Window Length in the LRT Model --- p.82
Chapter 6.3 --- The Relative Importance of Each Set of Features --- p.84
Chapter 6.4 --- The Effect of Removing Timing Information --- p.92
Chapter 6.5 --- Chapter Summary --- p.96
Chapter 7 --- Conclusions and Future Work --- p.98
Chapter 7.1 --- Contributions --- p.98
Chapter 7.2 --- Future Works --- p.100
Chapter 7.2.1 --- Further Extension of the Framework --- p.100
Chapter 7.2.2 --- Wider Applications of the Framework --- p.105
Bibliography --- p.106

APA, Harvard, Vancouver, ISO, and other styles

37

Cerdeirinha, João Manuel Macedo. "Recuperação de imagens digitais com base no conteúdo: estudo na Biblioteca de Arte e Arquivos da Fundação Calouste Gulbenkian." Master's thesis, 2019. http://hdl.handle.net/10362/91474.

Full text

Abstract:

O crescimento massivo de dados multimédia na Internet e surgimento de novas plataformas de partilha criou grandes desafios para a recuperação de informação. As limitações de pesquisas com base em texto para este tipo de conteúdo proporcionaram o desenvolvimento de uma abordagem de recuperação de informação com base no conteúdo (CBIR) que recebeu atenção crescente nas últimas décadas. Tendo em conta as pesquisas realizadas nesta área, e sendo o foco desta investigação as imagens digitais, são explorados conceitos e técnicas associadas a esta abordagem por meio de um levantamento teórico que relata a evolução da recuperação de informação e a importância que esta temática tem para a Gestão e Curadoria da Informação. No contexto dos sistemas que têm vindo a ser desenvolvidos recorrendo à indexação automática, indicam-se as diversas aplicações deste tipo de processo. São também identificadas ferramentas relacionadas disponíveis para a realização de um estudo da aplicação deste tipo de recuperação de imagens no contexto da Biblioteca de Arte e Arquivos da Fundação Calouste Gulbenkian e as coleções fotográficas que esta é detentora no seu acervo, considerando as particularidades da instituição em que estas se inserem. Para a demonstração pretendida e de acordo com os critérios estabelecidos, recorreu-se inicialmente a soluções CBIR disponíveis em linha e, numa fase seguinte, foi usada uma ferramenta de instalação local numa coleção específica. Através deste estudo, são atestados os pontos fortes e pontos fracos da recuperação de imagens digitais com base no conteúdo face à abordagem mais tradicional com base em metainformação textual em vigor atualmente nessas coleções. Tendo em consideração as necessidades dos utilizadores dos sistemas em que estes objetos digitais se encontram indexados, a combinação entre estas técnicas pode originar resultados mais satisfatórios.
The massive growth of multimedia data on the Internet and the emergence of new sharing platforms created major challenges for information retrieval. The limitations of text-based searches for this type of content have led to the development of a content-based information retrieval approach that has received increasing attention in recent decades. Taking into account the research carried out in this area, and digital images being the focus of this research, concepts and techniques associated with this approach are explored through a theoretical survey that reports the evolution of information retrieval and the importance that this subject has for Information Management and Curation. In the context of the systems that have been developed using automatic indexing, the various applications of this type of process are indicated. Available CBIR tools are also identified for a case study of the application of this type of image retrieval in the context of the Art Library and Archives of the Calouste Gulbenkian Foundation and the photographic collections that it holds in its resources, considering the particularities of the institution to which they belong. For the intended demonstration and according to the established criteria, online CBIR tools were initially used and, in the following phase, locally installed software was selected to search and retrieve in a specific collection. Through this case study, the strengths and weaknesses of content-based image retrieval are attested against the more traditional approach based on textual metadata currently in use in these collections. Taking into consideration the needs of users of the systems in which these digital objects are indexed, combining these techniques may lead to more satisfactory results.

APA, Harvard, Vancouver, ISO, and other styles

38

Wächter, Thomas. "Semi-automated Ontology Generation for Biocuration and Semantic Search." Doctoral thesis, 2010. https://tud.qucosa.de/id/qucosa%3A25496.

Full text

Abstract:

Background: In the life sciences, the amount of literature and experimental data grows at a tremendous rate. In order to effectively access and integrate these data, biomedical ontologies – controlled, hierarchical vocabularies – are being developed. Creating and maintaining such ontologies is a difficult, labour-intensive, manual process. Many computational methods which can support ontology construction have been proposed in the past. However, good, validated systems are largely missing. Motivation: The biocuration community plays a central role in the development of ontologies. Any method that can support their efforts has the potential to have a huge impact in the life sciences. Recently, a number of semantic search engines were created that make use of biomedical ontologies for document retrieval. To transfer the technology to other knowledge domains, suitable ontologies need to be created. One area where ontologies may prove particularly useful is the search for alternative methods to animal testing, an area where comprehensive search is of special interest to determine the availability or unavailability of alternative methods. Results: The Dresden Ontology Generator for Directed Acyclic Graphs (DOG4DAG) developed in this thesis is a system which supports the creation and extension of ontologies by semi-automatically generating terms, definitions, and parent-child relations from text in PubMed, the web, and PDF repositories. The system is seamlessly integrated into OBO-Edit and Protégé, two widely used ontology editors in the life sciences. DOG4DAG generates terms by identifying statistically significant noun-phrases in text. For definitions and parent-child relations it employs pattern-based web searches. Each generation step has been systematically evaluated using manually validated benchmarks. The term generation leads to high quality terms also found in manually created ontologies. Definitions can be retrieved for up to 78% of terms, child ancestor relations for up to 54%. No other validated system exists that achieves comparable results. To improve the search for information on alternative methods to animal testing an ontology has been developed that contains 17,151 terms of which 10% were newly created and 90% were re-used from existing resources. This ontology is the core of Go3R, the first semantic search engine in this field. When a user performs a search query with Go3R, the search engine expands this request using the structure and terminology of the ontology. The machine classification employed in Go3R is capable of distinguishing documents related to alternative methods from those which are not with an F-measure of 90% on a manual benchmark. Approximately 200,000 of the 19 million documents listed in PubMed were identified as relevant, either because a specific term was contained or due to the automatic classification. The Go3R search engine is available on-line under www.Go3R.org.

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!