Dissertations / Theses on the topic 'Automatic text retrieval'

To see the other types of publications on this topic, follow the link: Automatic text retrieval.

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 38 dissertations / theses for your research on the topic 'Automatic text retrieval.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Viana, Hugo Henrique Amorim. "Automatic information retrieval through text-mining." Master's thesis, Faculdade de Ciências e Tecnologia, 2013. http://hdl.handle.net/10362/11308.

Full text
Abstract:
The dissertation presented for obtaining the Master’s Degree in Electrical Engineering and Computer Science, at Universidade Nova de Lisboa, Faculdade de Ciências e Tecnologia
Nowadays, around a huge amount of firms in the European Union catalogued as Small and Medium Enterprises (SMEs), employ almost a great portion of the active workforce in Europe. Nonetheless, SMEs cannot afford implementing neither methods nor tools to systematically adapt innovation as a part of their business process. Innovation is the engine to be competitive in the globalized environment, especially in the current socio-economic situation. This thesis provides a platform that when integrated with ExtremeFactories(EF) project, aids SMEs to become more competitive by means of monitoring schedule functionality. In this thesis a text-mining platform that possesses the ability to schedule a gathering information through keywords is presented. In order to develop the platform, several choices concerning the implementation have been made, in the sense that one of them requires particular emphasis is the framework, Apache Lucene Core 2 by supplying an efficient text-mining tool and it is highly used for the purpose of the thesis.
APA, Harvard, Vancouver, ISO, and other styles
2

Lee, Hyo Sook. "Automatic text processing for Korean language free text retrieval." Thesis, University of Sheffield, 2000. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.322916.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Kay, Roderick Neil. "Text analysis, summarising and retrieval." Thesis, University of Salford, 1997. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.360435.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

McMurtry, William F. "Information Retrieval for Call Center Quality Assurance." The Ohio State University, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=osu1587036885211228.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Goyal, Pawan. "Analytic knowledge discovery techniques for ad-hoc information retrieval and automatic text summarization." Thesis, Ulster University, 2011. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.543897.

Full text
Abstract:
Information retrieval is broadly concerned with the problem of automated searching for information within some document repository to support various information requests by users. The traditional retrieval frameworks work on the simplistic assumptions of “word independence” and “bag-of-words”, giving rise to problems such as “term mismatch” and “context independent document indexing”. Automatic text summarization systems, which use the same paradigm as that of information retrieval, also suffer from these problems. The concept of “semantic relevance” has also not been formulated in the existing literature. This thesis presents a detailed investigation of the knowledge discovery models and proposes new approaches to address these issues. The traditional retrieval frameworks do not succeed in defining the document content fully because they do not process the concepts in the documents; only the words are processed. To address this issue, a document retrieval model has been proposed using concept hierarchies, learnt automatically from a corpora. A novel approach to give a meaningful representation to the concept nodes in a learnt hierarchy has been proposed using a fuzzy logic based soft least upper bound method. A novel approach of adapting the vector space model with dependency parse relations for information retrieval also has been developed. A user query for information retrieval (IR) applications may not contain the most appropriate terms (words) as actually intended by the user. This is usually referred to as the term mismatch problem and is a crucial research issue in IR. To address this issue, a theoretical framework for Query Representation (QR) has been developed through a comprehensive theoretical analysis of a parametric query vector. A lexical association function has been derived analytically using the relevance criteria. The proposed QR model expands the user query using this association function. A novel term association metric has been derived using the Bernoulli model of randomness. x The derived metric has been used to develop a Bernoulli Query Expansion (BQE) model. The Bernoulli model of randomness has also been extended to the pseudo relevance feedback problem by proposing a Bernoulli Pseudo Relevance (BPR) model. In the traditional retrieval frameworks, the context in which a term occurs is mostly overlooked in assigning its indexing weight. This results in context independent document indexing. To address this issue, a novel Neighborhood Based Document Smoothing (NBDS) model has been proposed, which uses the lexical association between terms to provide a context sensitive indexing weight to the document terms, i.e. the term weights are redistributed based on the lexical association with the context words. To address the “context independent document indexing” for sentence extraction based text summarization task, a lexical association measure derived using the Bernoulli model of randomness has been used. A new approach using the lexical association between terms has been proposed to give a context sensitive weight to the document terms and these weights have been used for the sentence extraction task. Developed analytically, the proposed QR, BQE, BPR and NBDS models provide a proper mathematical framework for query expansion and document smoothing techniques, which have largely been heuristic in the existing literature. Being developed in the generalized retrieval framework, as also proposed in this thesis, these models are applicable to all of the retrieval frameworks. These models have been empirically evaluated over the benchmark TREC datasets and have been shown to provide significantly better performance than the baseline retrieval frameworks to a large degree, without adding significant computational or storage burden. The Bernoulli model applied to the sentence extraction task has also been shown to enhance the performance of the baseline text summarization systems over the benchmark DUC datasets. The theoretical foundations alongwith the empirical results verify that the proposed knowledge discovery models in this thesis advance the state of the art in the field of information retrieval and automatic text summarization.
APA, Harvard, Vancouver, ISO, and other styles
6

Ermakova, Liana. "Short text contextualization in information retrieval : application to tweet contextualization and automatic query expansion." Thesis, Toulouse 2, 2016. http://www.theses.fr/2016TOU20023/document.

Full text
Abstract:
La communication efficace a tendance à suivre la loi du moindre effort. Selon ce principe, en utilisant une langue donnée les interlocuteurs ne veulent pas travailler plus que nécessaire pour être compris. Ce fait mène à la compression extrême de textes surtout dans la communication électronique, comme dans les microblogues, SMS, ou les requêtes dans les moteurs de recherche. Cependant souvent ces textes ne sont pas auto-suffisants car pour les comprendre, il est nécessaire d’avoir des connaissances sur la terminologie, les entités nommées ou les faits liés. Ainsi, la tâche principale de la recherche présentée dans ce mémoire de thèse de doctorat est de fournir le contexte d’un texte court à l’utilisateur ou au système comme à un moteur de recherche par exemple.Le premier objectif de notre travail est d'aider l’utilisateur à mieux comprendre un message court par l’extraction du contexte d’une source externe comme le Web ou la Wikipédia au moyen de résumés construits automatiquement. Pour cela nous proposons une approche pour le résumé automatique de documents multiples et nous l’appliquons à la contextualisation de messages, notamment à la contextualisation de tweets. La méthode que nous proposons est basée sur la reconnaissance des entités nommées, la pondération des parties du discours et la mesure de la qualité des phrases. Contrairement aux travaux précédents, nous introduisons un algorithme de lissage en fonction du contexte local. Notre approche s’appuie sur la structure thème-rhème des textes. De plus, nous avons développé un algorithme basé sur les graphes pour le ré-ordonnancement des phrases. La méthode a été évaluée à la tâche INEX/CLEF Tweet Contextualization sur une période de 4 ans. La méthode a été également adaptée pour la génération de snippets. Les résultats des évaluations attestent une bonne performance de notre approche
The efficient communication tends to follow the principle of the least effort. According to this principle, using a given language interlocutors do not want to work any harder than necessary to reach understanding. This fact leads to the extreme compression of texts especially in electronic communication, e.g. microblogs, SMS, search queries. However, sometimes these texts are not self-contained and need to be explained since understanding them requires knowledge of terminology, named entities or related facts. The main goal of this research is to provide a context to a user or a system from a textual resource.The first aim of this work is to help a user to better understand a short message by extracting a context from an external source like a text collection, the Web or the Wikipedia by means of text summarization. To this end we developed an approach for automatic multi-document summarization and we applied it to short message contextualization, in particular to tweet contextualization. The proposed method is based on named entity recognition, part-of-speech weighting and sentence quality measuring. In contrast to previous research, we introduced an algorithm for smoothing from the local context. Our approach exploits topic-comment structure of a text. Moreover, we developed a graph-based algorithm for sentence reordering. The method has been evaluated at INEX/CLEF tweet contextualization track. We provide the evaluation results over the 4 years of the track. The method was also adapted to snippet retrieval. The evaluation results indicate good performance of the approach
APA, Harvard, Vancouver, ISO, and other styles
7

Sequeira, José Francisco Rodrigues. "Automatic knowledge base construction from unstructured text." Master's thesis, Universidade de Aveiro, 2016. http://hdl.handle.net/10773/17910.

Full text
Abstract:
Mestrado em Engenharia de Computadores e Telemática
Taking into account the overwhelming number of biomedical publications being produced, the effort required for a user to efficiently explore those publications in order to establish relationships between a wide range of concepts is staggering. This dissertation presents GRACE, a web-based platform that provides an advanced graphical exploration interface that allows users to traverse the biomedical domain in order to find explicit and latent associations between annotated biomedical concepts belonging to a variety of semantic types (e.g., Genes, Proteins, Disorders, Procedures and Anatomy). The knowledge base utilized is a collection of MEDLINE articles with English abstracts. These annotations are then stored in an efficient data storage that allows for complex queries and high-performance data delivery. Concept relationship are inferred through statistical analysis, applying association measures to annotated terms. These processes grant the graphical interface the ability to create, in real-time, a data visualization in the form of a graph for the exploration of these biomedical concept relationships.
Tendo em conta o crescimento do número de publicações biomédicas a serem produzidas todos os anos, o esforço exigido para que um utilizador consiga, de uma forma eficiente, explorar estas publicações para conseguir estabelecer associações entre um conjunto alargado de conceitos torna esta tarefa exaustiva. Nesta disertação apresentamos uma plataforma web chamada GRACE, que providencia uma interface gráfica de exploração que permite aos utilizadores navegar pelo domínio biomédico em busca de associações explícitas ou latentes entre conceitos biomédicos pertencentes a uma variedade de domínios semânticos (i.e., Genes, Proteínas, Doenças, Procedimentos e Anatomia). A base de conhecimento usada é uma coleção de artigos MEDLINE com resumos escritos na língua inglesa. Estas anotações são armazenadas numa base de dados que permite pesquisas complexas e obtenção de dados com alta performance. As relações entre conceitos são inferidas a partir de análise estatística, aplicando medidas de associações entre os conceitos anotados. Estes processos permitem à interface gráfica criar, em tempo real, uma visualização de dados, na forma de um grafo, para a exploração destas relações entre conceitos do domínio biomédico.
APA, Harvard, Vancouver, ISO, and other styles
8

Martinez-Alvarez, Miguel. "Knowledge-enhanced text classification : descriptive modelling and new approaches." Thesis, Queen Mary, University of London, 2014. http://qmro.qmul.ac.uk/xmlui/handle/123456789/27205.

Full text
Abstract:
The knowledge available to be exploited by text classification and information retrieval systems has significantly changed, both in nature and quantity, in the last years. Nowadays, there are several sources of information that can potentially improve the classification process, and systems should be able to adapt to incorporate multiple sources of available data in different formats. This fact is specially important in environments where the required information changes rapidly, and its utility may be contingent on timely implementation. For these reasons, the importance of adaptability and flexibility in information systems is rapidly growing. Current systems are usually developed for specific scenarios. As a result, significant engineering effort is needed to adapt them when new knowledge appears or there are changes in the information needs. This research investigates the usage of knowledge within text classification from two different perspectives. On one hand, the application of descriptive approaches for the seamless modelling of text classification, focusing on knowledge integration and complex data representation. The main goal is to achieve a scalable and efficient approach for rapid prototyping for Text Classification that can incorporate different sources and types of knowledge, and to minimise the gap between the mathematical definition and the modelling of a solution. On the other hand, the improvement of different steps of the classification process where knowledge exploitation has traditionally not been applied. In particular, this thesis introduces two classification sub-tasks, namely Semi-Automatic Text Classification (SATC) and Document Performance Prediction (DPP), and several methods to address them. SATC focuses on selecting the documents that are more likely to be wrongly assigned by the system to be manually classified, while automatically labelling the rest. Document performance prediction estimates the classification quality that will be achieved for a document, given a classifier. In addition, we also propose a family of evaluation metrics to measure degrees of misclassification, and an adaptive variation of k-NN.
APA, Harvard, Vancouver, ISO, and other styles
9

Dutra, Marcio Branquinho. "Busca guiada de patentes de Bioinformática." Universidade de São Paulo, 2013. http://www.teses.usp.br/teses/disponiveis/95/95131/tde-07022014-150130/.

Full text
Abstract:
As patentes são licenças públicas temporárias outorgadas pelo Estado e que garantem aos inventores e concessionários a exploração econômica de suas invenções. Escritórios de marcas e patentes recomendam aos interessados na concessão que, antes do pedido formal de uma patente, efetuem buscas em diversas bases de dados utilizando sistemas clássicos de busca de patentes e outras ferramentas de busca específicas, com o objetivo de certificar que a criação a ser depositada ainda não foi publicada, seja na sua área de origem ou em outras áreas. Pesquisas demonstram que a utilização de informações de classificação nas buscas por patentes melhoram a eficiência dos resultados das consultas. A pesquisa associada ao trabalho aqui reportado tem como objetivo explorar artefatos linguísticos, técnicas de Recuperação de Informação e técnicas de Classificação Textual para guiar a busca por patentes de Bioinformática. O resultado dessa investigação é o Sistema de Busca Guiada de Patentes de Bioinformática (BPS), o qual utiliza um classificador automático para guiar as buscas por patentes de Bioinformática. A utilização do BPS é demonstrada em comparações com ferramentas de busca de patentes atuais para uma coleção específica de patentes de Bioinformática. No futuro, deve-se experimentar o BPS em coleções diferentes e mais robustas.
Patents are temporary public licenses granted by the State to ensure to inventors and assignees economical exploration rights. Trademark and patent offices recommend to perform wide searches in different databases using classic patent search systems and specific tools before a patent\'s application. The goal of these searches is to ensure the invention has not been published yet, either in its original field or in other fields. Researches have shown the use of classification information improves the efficiency on searches for patents. The objetive of the research related to this work is to explore linguistic artifacts, Information Retrieval techniques and Automatic Classification techniques, to guide searches for Bioinformatics patents. The result of this work is the Bioinformatics Patent Search System (BPS), that uses automatic classification to guide searches for Bioinformatics patents. The utility of BPS is illustrated by a comparison with other patent search tools. In the future, BPS system must be experimented with more robust collections.
APA, Harvard, Vancouver, ISO, and other styles
10

Artchounin, Daniel. "Tuning of machine learning algorithms for automatic bug assignment." Thesis, Linköpings universitet, Programvara och system, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-139230.

Full text
Abstract:
In software development projects, bug triage consists mainly of assigning bug reports to software developers or teams (depending on the project). The partial or total automation of this task would have a positive economic impact on many software projects. This thesis introduces a systematic four-step method to find some of the best configurations of several machine learning algorithms intending to solve the automatic bug assignment problem. These four steps are respectively used to select a combination of pre-processing techniques, a bug report representation, a potential feature selection technique and to tune several classifiers. The aforementioned method has been applied on three software projects: 66 066 bug reports of a proprietary project, 24 450 bug reports of Eclipse JDT and 30 358 bug reports of Mozilla Firefox. 619 configurations have been applied and compared on each of these three projects. In production, using the approach introduced in this work on the bug reports of the proprietary project would have increased the accuracy by up to 16.64 percentage points.
APA, Harvard, Vancouver, ISO, and other styles
11

Dang, Quoc Bao. "Information spotting in huge repositories of scanned document images." Thesis, La Rochelle, 2018. http://www.theses.fr/2018LAROS024/document.

Full text
Abstract:
Ce travail vise à développer un cadre générique qui est capable de produire des applications de localisation d'informations à partir d’une caméra (webcam, smartphone) dans des très grands dépôts d'images de documents numérisés et hétérogènes via des descripteurs locaux. Ainsi, dans cette thèse, nous proposons d'abord un ensemble de descripteurs qui puissent être appliqués sur des contenus aux caractéristiques génériques (composés de textes et d’images) dédié aux systèmes de recherche et de localisation d'images de documents. Nos descripteurs proposés comprennent SRIF, PSRIF, DELTRIF et SSKSRIF qui sont construits à partir de l’organisation spatiale des points d’intérêts les plus proches autour d'un point-clé pivot. Tous ces points sont extraits à partir des centres de gravité des composantes connexes de l‘image. A partir de ces points d’intérêts, des caractéristiques géométriques invariantes aux dégradations sont considérées pour construire nos descripteurs. SRIF et PSRIF sont calculés à partir d'un ensemble local des m points d’intérêts les plus proches autour d'un point d’intérêt pivot. Quant aux descripteurs DELTRIF et SSKSRIF, cette organisation spatiale est calculée via une triangulation de Delaunay formée à partir d'un ensemble de points d’intérêts extraits dans les images. Cette seconde version des descripteurs permet d’obtenir une description de forme locale sans paramètres. En outre, nous avons également étendu notre travail afin de le rendre compatible avec les descripteurs classiques de la littérature qui reposent sur l’utilisation de points d’intérêts dédiés de sorte qu'ils puissent traiter la recherche et la localisation d'images de documents à contenu hétérogène. La seconde contribution de cette thèse porte sur un système d'indexation de très grands volumes de données à partir d’un descripteur volumineux. Ces deux contraintes viennent peser lourd sur la mémoire du système d’indexation. En outre, la très grande dimensionnalité des descripteurs peut amener à une réduction de la précision de l'indexation, réduction liée au problème de dimensionnalité. Nous proposons donc trois techniques d'indexation robustes, qui peuvent toutes être employées sans avoir besoin de stocker les descripteurs locaux dans la mémoire du système. Cela permet, in fine, d’économiser la mémoire et d’accélérer le temps de recherche de l’information, tout en s’abstrayant d’une validation de type distance. Pour cela, nous avons proposé trois méthodes s’appuyant sur des arbres de décisions : « randomized clustering tree indexing” qui hérite des propriétés des kd-tree, « kmean-tree » et les « random forest » afin de sélectionner de manière aléatoire les K dimensions qui permettent de combiner la plus grande variance expliquée pour chaque nœud de l’arbre. Nous avons également proposé une fonction de hachage étendue pour l'indexation de contenus hétérogènes provenant de plusieurs couches de l'image. Comme troisième contribution de cette thèse, nous avons proposé une méthode simple et robuste pour calculer l'orientation des régions obtenues par le détecteur MSER, afin que celui-ci puisse être combiné avec des descripteurs dédiés. Comme la plupart de ces descripteurs visent à capturer des informations de voisinage autour d’une région donnée, nous avons proposé un moyen d'étendre les régions MSER en augmentant le rayon de chaque région. Cette stratégie peut également être appliquée à d'autres régions détectées afin de rendre les descripteurs plus distinctifs. Enfin, afin d'évaluer les performances de nos contributions, et en nous fondant sur l'absence d'ensemble de données publiquement disponibles pour la localisation d’information hétérogène dans des images capturées par une caméra, nous avons construit trois jeux de données qui sont disponibles pour la communauté scientifique
This work aims at developing a generic framework which is able to produce camera-based applications of information spotting in huge repositories of heterogeneous content document images via local descriptors. The targeted systems may take as input a portion of an image acquired as a query and the system is capable of returning focused portion of database image that match the query best. We firstly propose a set of generic feature descriptors for camera-based document images retrieval and spotting systems. Our proposed descriptors comprise SRIF, PSRIF, DELTRIF and SSKSRIF that are built from spatial space information of nearest keypoints around a keypoints which are extracted from centroids of connected components. From these keypoints, the invariant geometrical features are considered to be taken into account for the descriptor. SRIF and PSRIF are computed from a local set of m nearest keypoints around a keypoint. While DELTRIF and SSKSRIF can fix the way to combine local shape description without using parameter via Delaunay triangulation formed from a set of keypoints extracted from a document image. Furthermore, we propose a framework to compute the descriptors based on spatial space of dedicated keypoints e.g SURF or SIFT or ORB so that they can deal with heterogeneous-content camera-based document image retrieval and spotting. In practice, a large-scale indexing system with an enormous of descriptors put the burdens for memory when they are stored. In addition, high dimension of descriptors can make the accuracy of indexing reduce. We propose three robust indexing frameworks that can be employed without storing local descriptors in the memory for saving memory and speeding up retrieval time by discarding distance validating. The randomized clustering tree indexing inherits kd-tree, kmean-tree and random forest from the way to select K dimensions randomly combined with the highest variance dimension from each node of the tree. We also proposed the weighted Euclidean distance between two data points that is computed and oriented the highest variance dimension. The secondly proposed hashing relies on an indexing system that employs one simple hash table for indexing and retrieving without storing database descriptors. Besides, we propose an extended hashing based method for indexing multi-kinds of features coming from multi-layer of the image. Along with proposed descriptors as well indexing frameworks, we proposed a simple robust way to compute shape orientation of MSER regions so that they can combine with dedicated descriptors (e.g SIFT, SURF, ORB and etc.) rotation invariantly. In the case that descriptors are able to capture neighborhood information around MSER regions, we propose a way to extend MSER regions by increasing the radius of each region. This strategy can be also applied for other detected regions in order to make descriptors be more distinctive. Moreover, we employed the extended hashing based method for indexing multi-kinds of features from multi-layer of images. This system are not only applied for uniform feature type but also multiple feature types from multi-layers separated. Finally, in order to assess the performances of our contributions, and based on the assessment that no public dataset exists for camera-based document image retrieval and spotting systems, we built a new dataset which has been made freely and publicly available for the scientific community. This dataset contains portions of document images acquired via a camera as a query. It is composed of three kinds of information: textual content, graphical content and heterogeneous content
APA, Harvard, Vancouver, ISO, and other styles
12

Wächter, Thomas. "Semi-automated Ontology Generation for Biocuration and Semantic Search." Doctoral thesis, Saechsische Landesbibliothek- Staats- und Universitaetsbibliothek Dresden, 2011. http://nbn-resolving.de/urn:nbn:de:bsz:14-qucosa-64838.

Full text
Abstract:
Background: In the life sciences, the amount of literature and experimental data grows at a tremendous rate. In order to effectively access and integrate these data, biomedical ontologies – controlled, hierarchical vocabularies – are being developed. Creating and maintaining such ontologies is a difficult, labour-intensive, manual process. Many computational methods which can support ontology construction have been proposed in the past. However, good, validated systems are largely missing. Motivation: The biocuration community plays a central role in the development of ontologies. Any method that can support their efforts has the potential to have a huge impact in the life sciences. Recently, a number of semantic search engines were created that make use of biomedical ontologies for document retrieval. To transfer the technology to other knowledge domains, suitable ontologies need to be created. One area where ontologies may prove particularly useful is the search for alternative methods to animal testing, an area where comprehensive search is of special interest to determine the availability or unavailability of alternative methods. Results: The Dresden Ontology Generator for Directed Acyclic Graphs (DOG4DAG) developed in this thesis is a system which supports the creation and extension of ontologies by semi-automatically generating terms, definitions, and parent-child relations from text in PubMed, the web, and PDF repositories. The system is seamlessly integrated into OBO-Edit and Protégé, two widely used ontology editors in the life sciences. DOG4DAG generates terms by identifying statistically significant noun-phrases in text. For definitions and parent-child relations it employs pattern-based web searches. Each generation step has been systematically evaluated using manually validated benchmarks. The term generation leads to high quality terms also found in manually created ontologies. Definitions can be retrieved for up to 78% of terms, child ancestor relations for up to 54%. No other validated system exists that achieves comparable results. To improve the search for information on alternative methods to animal testing an ontology has been developed that contains 17,151 terms of which 10% were newly created and 90% were re-used from existing resources. This ontology is the core of Go3R, the first semantic search engine in this field. When a user performs a search query with Go3R, the search engine expands this request using the structure and terminology of the ontology. The machine classification employed in Go3R is capable of distinguishing documents related to alternative methods from those which are not with an F-measure of 90% on a manual benchmark. Approximately 200,000 of the 19 million documents listed in PubMed were identified as relevant, either because a specific term was contained or due to the automatic classification. The Go3R search engine is available on-line under www.Go3R.org.
APA, Harvard, Vancouver, ISO, and other styles
13

Delgado, Diogo Miguel Melo. "Automated illustration of multimedia stories." Master's thesis, Faculdade de Ciências e Tecnologia, 2010. http://hdl.handle.net/10362/4478.

Full text
Abstract:
Submitted in part fulfillment of the requirements for the degree of Master in Computer Science
We all had the problem of forgetting about what we just read a few sentences before. This comes from the problem of attention and is more common with children and the elderly. People feel either bored or distracted by something more interesting. The challenge is how can multimedia systems assist users in reading and remembering stories? One solution is to use pictures to illustrate stories as a mean to captivate ones interest as it either tells a story or makes the viewer imagine one. This thesis researches the problem of automated story illustration as a method to increase the readers’ interest and attention. We formulate the hypothesis that an automated multimedia system can help users in reading a story by stimulating their reading memory with adequate visual illustrations. We propose a framework that tells a story and attempts to capture the readers’ attention by providing illustrations that spark the readers’ imagination. The framework automatically creates a multimedia presentation of the news story by (1) rendering news text in a sentence by-sentence fashion, (2) providing mechanisms to select the best illustration for each sentence and (3) select the set of illustrations that guarantees the best sequence. These mechanisms are rooted in image and text retrieval techniques. To further improve users’ attention, users may also activate a text-to-speech functionality according to their preference or reading difficulties. First experiments show how Flickr images can illustrate BBC news articles and provide a better experience to news readers. On top of the illustration methods, a user feedback feature was implemented to perfect the illustrations selection. With this feature users can aid the framework in selecting more accurate results. Finally, empirical evaluations were performed in order to test the user interface,image/sentence association algorithms and users’ feedback functionalities. The respective results are discussed.
APA, Harvard, Vancouver, ISO, and other styles
14

Wu, Zimin. "A partial syntactic analysis-based pre-processor for automatic indexing and retrieval of Chinese texts." Thesis, Loughborough University, 1992. https://dspace.lboro.ac.uk/2134/13685.

Full text
Abstract:
Automatic indexing is the automatic creation of a text surrogate, normally keywords or phrases, to represent the original text. In the current English text retrieval systems, this process of content representation is accomplished by extracting words using spaces and punctuations as word delimiters. The same technique cannot easily be applied to Chinese texts which contain no obvious word boundaries; they appear to be a linear sequence of non-spaced or equally spaced ideographic characters and thenumber of characters in words varies. The solution to the problem lies in morphological and syntactic analyses of Chinese morphemes, words and phrases. The idea is inspired by the experiments on English computational morphology and its application to English text retrieval, mainly automatic compound and phrase indexing. These areas are particularly germane to Chinese because typographically there are no morph and phrase boundaries in either Chinese or English texts. The experiment is based on the hypothesis that words and phrases exceeding two Chinese characters can be characterised by a grammar that describes the concatenation behaviour of morphological and syntactic categories. This is examined using the following three procedures: (1) text segmentation - texts are divided into one and two character segments by searching a dictionary containing over 17000 morphemes and words, which are tagged with 'morphological and syntactic categories. (2) category disambiguation - for the resulting morphemes and words tagged with more than one category, the correct one is selected based on context (3) parsing - the segments are analysed using the grammar, which combines them into compound and complex words and phrases for indexing and retrieval. The utilities employed in the experiment include CCOOS, an extended version of MSOOS providing for Chinese I/O system,Chinese Wordstar for text input and Chinese dBASEIII for dictionary construction. Source codes are written in Turbo BASIC including its database toolbox. Thiny texts are drawn randomly from newspapers to form thcsample for the experiment. The results prove that the partial syntactic analysis-based approach can extract keywords with a good degree of accuracy.
APA, Harvard, Vancouver, ISO, and other styles
15

Ademi, Muhamet. "adXtractor – Automated and Adaptive Generation of Wrappers for Information Retrieval." Thesis, Malmö högskola, Fakulteten för teknik och samhälle (TS), 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:mau:diva-20071.

Full text
Abstract:
The aim of this project is to investigate the feasibility of retrieving unstructured automotive listings from structured web pages on the Internet. The research has two major purposes: (1) to investigate whether it is feasible to pair information extraction algorithms and compute wrappers (2) demonstrate the results of pairing these techniques and evaluate the measurements. We merge two training sets available on the web to construct reference sets which is the basis for the information extraction. The wrappers are computed by using information extraction techniques to identify data properties with a variety of techniques such as fuzzy string matching, regular expressions and document tree analysis. The results demonstrate that it is possible to pair these techniques successfully and retrieve the majority of the listings. Additionally, the findings also suggest that many platforms utilise lazy loading to populate image resources which the algorithm is unable to capture. In conclusion, the study demonstrated that it is possible to use information extraction to compute wrappers dynamically by identifying data properties. Furthermore, the study demonstrates the ability to open non-queryable domain data through a unified service.
APA, Harvard, Vancouver, ISO, and other styles
16

Svensson, Pontus. "Automated Image Suggestions for News Articles : An Evaluation of Text and Image Representations in an Image Retrieval System." Thesis, Linköpings universitet, Interaktiva och kognitiva system, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-166669.

Full text
Abstract:
Multimodal machine learning is a subfield of machine learning that aims to relate data from different modalities, such as texts and images. One of the many applications that could be built upon this technique is an image retrieval system that, given a text query, retrieves suitable images from a database. In this thesis, a retrieval system based on canonical correlation is used to suggest images for news articles. Different dense text representations produced by Word2vec and Doc2vec, and image representations produced by pre-trained convolutional neural networks are explored to find out how they affect the suggestions. Which part of an article is best suited as a query to the system is also studied. Also, experiments are carried out to determine if an article's date of publication can be used to improve the suggestions. The results show that Word2vec outperforms Doc2vec in the task, which indicates that the meaning of article texts are not as important as the individual words they consist of. Furthermore, the queries are improved by rewarding words that are particularly significant.
APA, Harvard, Vancouver, ISO, and other styles
17

Dziczkowski, Grzegorz. "Analyse des sentiments : système autonome d'exploration des opinions exprimées dans les critiques cinématographiques." Phd thesis, École Nationale Supérieure des Mines de Paris, 2008. http://tel.archives-ouvertes.fr/tel-00408754.

Full text
Abstract:
Cette thèse décrit l'étude et le développement d'un système conçu pour l'évaluation des sentiments des critiques cinématographiques. Un tel système permet :
- la recherche automatique des critiques sur Internet,
- l'évaluation et la notation des opinions des critiques cinématographiques,
- la publication des résultats.

Afin d'améliorer les résultats d'application des algorithmes prédicatifs, l'objectif de ce système est de fournir un système de support pour les moteurs de prédiction analysant les profils des utilisateurs. Premièrement, le système recherche et récupère les probables critiques cinématographiques de l'Internet, en particulier celles exprimées par les commentateurs prolifiques.

Par la suite, le système procède à une évaluation et à une notation de l'opinion
exprimée dans ces critiques cinématographiques pour automatiquement associer
une note numérique à chaque critique ; tel est l'objectif du système.
La dernière étape est de regrouper les critiques (ainsi que les notes) avec l'utilisateur qui les a écrites afin de créer des profils complets, et de mettre à disposition ces profils pour les moteurs de prédictions.

Pour le développement de ce système, les travaux de recherche de cette thèse portaient essentiellement sur la notation des sentiments ; ces travaux s'insérant dans les domaines de ang : Opinion Mining et d'Analyse des Sentiments.
Notre système utilise trois méthodes différentes pour le classement des opinions. Nous présentons deux nouvelles méthodes ; une fondée sur les connaissances linguistiques et une fondée sur la limite de traitement statistique et linguistique. Les résultats obtenus sont ensuite comparés avec la méthode statistique basée sur le classificateur de Bayes, largement utilisée dans le domaine.
Il est nécessaire ensuite de combiner les résultats obtenus, afin de rendre l'évaluation finale aussi précise que possible. Pour cette tâche nous avons utilisé un quatrième classificateur basé sur les réseaux de neurones.

Notre notation des sentiments à savoir la notation des critiques est effectuée sur une échelle de 1 à 5. Cette notation demande une analyse linguistique plus profonde qu'une notation seulement binaire : positive ou négative, éventuellement subjective ou objective, habituellement utilisée.

Cette thèse présente de manière globale tous les modules du système conçu et de manière plus détaillée la partie de notation de l'opinion. En particulier, nous mettrons en évidence les avantages de l'analyse linguistique profonde moins utilisée dans le domaine de l'analyse des sentiments que l'analyse statistique.
APA, Harvard, Vancouver, ISO, and other styles
18

Şentürk, Sertan. "Computational analysis of audio recordings and music scores for the description and discovery of Ottoman-Turkish Makam music." Doctoral thesis, Universitat Pompeu Fabra, 2017. http://hdl.handle.net/10803/402102.

Full text
Abstract:
This thesis addresses several shortcomings on the current state of the art methodologies in music information retrieval (MIR). In particular, it proposes several computational approaches to automatically analyze and describe music scores and audio recordings of Ottoman-Turkish makam music (OTMM). The main contributions of the thesis are the music corpus that has been created to carry out the research and the audio-score alignment methodology developed for the analysis of the corpus. In addition, several novel computational analysis methodologies are presented in the context of common MIR tasks of relevance for OTMM. Some example tasks are predominant melody extraction, tonic identification, tempo estimation, makam recognition, tuning analysis, structural analysis and melodic progression analysis. These methodologies become a part of a complete system called Dunya-makam for the exploration of large corpora of OTMM. The thesis starts by presenting the created CompMusic Ottoman- Turkish makam music corpus. The corpus includes 2200 music scores, more than 6500 audio recordings, and accompanying metadata. The data has been collected, annotated and curated with the help of music experts. Using criteria such as completeness, coverage and quality, we validate the corpus and show its research potential. In fact, our corpus is the largest and most representative resource of OTMM that can be used for computational research. Several test datasets have also been created from the corpus to develop and evaluate the specific methodologies proposed for different computational tasks addressed in the thesis. The part focusing on the analysis of music scores is centered on phrase and section level structural analysis. Phrase boundaries are automatically identified using an existing state-of-the-art segmentation methodology. Section boundaries are extracted using heuristics specific to the formatting of the music scores. Subsequently, a novel method based on graph analysis is used to establish similarities across these structural elements in terms of melody and lyrics, and to label the relations semiotically. The audio analysis section of the thesis reviews the state-of-the-art for analysing the melodic aspects of performances of OTMM. It proposes adaptations of existing predominant melody extraction methods tailored to OTMM. It also presents improvements over pitch-distribution-based tonic identification and makam recognition methodologies. The audio-score alignment methodology is the core of the thesis. It addresses the culture-specific challenges posed by the musical characteristics, music theory related representations and oral praxis of OTMM. Based on several techniques such as subsequence dynamic time warping, Hough transform and variable-length Markov models, the audio-score alignment methodology is designed to handle the structural differences between music scores and audio recordings. The method is robust to the presence of non-notated melodic expressions, tempo deviations within the music performances, and differences in tonic and tuning. The methodology utilizes the outputs of the score and audio analysis, and links the audio and the symbolic data. In addition, the alignment methodology is used to obtain score-informed description of audio recordings. The scoreinformed audio analysis not only simplifies the audio feature extraction steps that would require sophisticated audio processing approaches, but also substantially improves the performance compared with results obtained from the state-of-the-art methods solely relying on audio data. The analysis methodologies presented in the thesis are applied to the CompMusic Ottoman-Turkish makam music corpus and integrated into a web application aimed at culture-aware music discovery. Some of the methodologies have already been applied to other music traditions such as Hindustani, Carnatic and Greek music. Following open research best practices, all the created data, software tools and analysis results are openly available. The methodologies, the tools and the corpus itself provide vast opportunities for future research in many fields such as music information retrieval, computational musicology and music education.
Esta tesis aborda varias limitaciones de las metodologías más avanzadas en el campo de recuperación de información musical (MIR por sus siglas en inglés). En particular, propone varios métodos computacionales para el análisis y la descripción automáticas de partituras y grabaciones de audio de música de makam turco-otomana (MMTO). Las principales contribuciones de la tesis son el corpus de música que ha sido creado para el desarrollo de la investigación y la metodología para alineamiento de audio y partitura desarrollada para el análisis del corpus. Además, se presentan varias metodologías nuevas para análisis computacional en el contexto de las tareas comunes de MIR que son relevantes para MMTO. Algunas de estas tareas son, por ejemplo, extracción de la melodía predominante, identificación de la tónica, estimación de tempo, reconocimiento de makam, análisis de afinación, análisis estructural y análisis de progresión melódica. Estas metodologías constituyen las partes de un sistema completo para la exploración de grandes corpus de MMTO llamado Dunya-makam. La tesis comienza presentando el corpus de música de makam turcootomana de CompMusic. El corpus incluye 2200 partituras, más de 6500 grabaciones de audio, y los metadatos correspondientes. Los datos han sido recopilados, anotados y revisados con la ayuda de expertos. Utilizando criterios como compleción, cobertura y calidad, validamos el corpus y mostramos su potencial para investigación. De hecho, nuestro corpus constituye el recurso de mayor tamaño y representatividad disponible para la investigación computacional de MMTO. Varios conjuntos de datos para experimentación han sido igualmente creados a partir del corpus, con el fin de desarrollar y evaluar las metodologías específicas propuestas para las diferentes tareas computacionales abordadas en la tesis. La parte dedicada al análisis de las partituras se centra en el análisis estructural a nivel de sección y de frase. Los márgenes de frase son identificados automáticamente usando uno de los métodos de segmentación existentes más avanzados. Los márgenes de sección son extraídos usando una heurística específica al formato de las partituras. A continuación, se emplea un método de nueva creación basado en análisis gráfico para establecer similitudes a través de estos elementos estructurales en cuanto a melodía y letra, así como para etiquetar relaciones semióticamente. La sección de análisis de audio de la tesis repasa el estado de la cuestión en cuanto a análisis de los aspectos melódicos en grabaciones de MMTO. Se proponen modificaciones de métodos existentes para extracción de melodía predominante para ajustarlas a MMTO. También se presentan mejoras de metodologías tanto para identificación de tónica basadas en distribución de alturas, como para reconocimiento de makam. La metodología para alineación de audio y partitura constituye el grueso de la tesis. Aborda los retos específicos de esta cultura según vienen determinados por las características musicales, las representaciones relacionadas con la teoría musical y la praxis oral de MMTO. Basada en varias técnicas tales como deformaciones dinámicas de tiempo subsecuentes, transformada de Hough y modelos de Markov de longitud variable, la metodología de alineamiento de audio y partitura está diseñada para tratar las diferencias estructurales entre partituras y grabaciones de audio. El método es robusto a la presencia de expresiones melódicas no anotadas, desviaciones de tiempo en las grabaciones, y diferencias de tónica y afinación. La metodología utiliza los resultados del análisis de partitura y audio para enlazar el audio y los datos simbólicos. Además, la metodología de alineación se usa para obtener una descripción informada por partitura de las grabaciones de audio. El análisis de audio informado por partitura no sólo simplifica los pasos para la extracción de características de audio que de otro modo requerirían sofisticados métodos de procesado de audio, sino que también mejora sustancialmente su rendimiento en comparación con los resultados obtenidos por los métodos más avanzados basados únicamente en datos de audio. Las metodologías analíticas presentadas en la tesis son aplicadas al corpus de música de makam turco-otomana de CompMusic e integradas en una aplicación web dedicada al descubrimiento culturalmente específico de música. Algunas de las metodologías ya han sido aplicadas a otras tradiciones musicales, como música indostaní, carnática y griega. Siguiendo las mejores prácticas de investigación en abierto, todos los datos creados, las herramientas de software y los resultados de análisis está disponibles públicamente. Las metodologías, las herramientas y el corpus en sí mismo ofrecen grandes oportunidades para investigaciones futuras en muchos campos tales como recuperación de información musical, musicología computacional y educación musical.
Aquesta tesi adreça diverses deficiències en l’estat actual de les metodologies d’extracció d’informació de música (Music Information Retrieval o MIR). En particular, la tesi proposa diverses estratègies per analitzar i descriure automàticament partitures musicals i enregistraments d’actuacions musicals de música Makam Turca Otomana (OTMM en les seves sigles en anglès). Les contribucions principals de la tesi són els corpus musicals que s’han creat en el context de la tesi per tal de dur a terme la recerca i la metodologia de alineament d’àudio amb la partitura que s’ha desenvolupat per tal d’analitzar els corpus. A més la tesi presenta diverses noves metodologies d’anàlisi computacional d’OTMM per a les tasques més habituals en MIR. Alguns exemples d’aquestes tasques són la extracció de la melodia principal, la identificació del to musical, l’estimació de tempo, el reconeixement de Makam, l’anàlisi de la afinació, l’anàlisi de la estructura musical i l’anàlisi de la progressió melòdica. Aquest seguit de metodologies formen part del sistema Dunya-makam per a la exploració de grans corpus musicals d’OTMM. En primer lloc, la tesi presenta el corpus CompMusic Ottoman- Turkish makam music. Aquest inclou 2200 partitures musicals, més de 6500 enregistraments d’àudio i metadata complementària. Les dades han sigut recopilades i anotades amb ajuda d’experts en aquest repertori musical. El corpus ha estat validat en termes de d’exhaustivitat, cobertura i qualitat i mostrem aquí el seu potencial per a la recerca. De fet, aquest corpus és el la font més gran i representativa de OTMM que pot ser utilitzada per recerca computacional. També s’han desenvolupat diversos subconjunts de dades per al desenvolupament i evaluació de les metodologies específiques proposades per a les diverses tasques computacionals que es presenten en aquest tesi. La secció de la tesi que tracta de l’anàlisi de partitures musicals se centra en l’anàlisi estructural a nivell de secció i de frase musical. Els límits temporals de les frases musicals s’identifiquen automàticament gràcies a un metodologia de segmentació d’última generació. Els límits de les seccions s’extreuen utilitzant un seguit de regles heurístiques determinades pel format de les partitures musicals. Posteriorment s’utilitza un nou mètode basat en anàlisi gràfic per establir semblances entre aquest elements estructurals en termes de melodia i text. També s’utilitza aquest mètode per etiquetar les relacions semiòtiques existents. La següent secció de la tesi tracta sobre anàlisi d’àudio i en particular revisa les tecnologies d’avantguardia d’anàlisi dels aspectes melòdics en OTMM. S’hi proposen adaptacions dels mètodes d’extracció de melodia existents que s’ajusten a OTMM. També s’hi presenten millores en metodologies de reconeixement de makam i en identificació de tònica basats en distribució de to. La metodologia d’alineament d’àudio amb partitura és el nucli de la tesi. Aquesta aborda els reptes culturalment específics imposats per les característiques musicals, les representacions de la teoria musical i la pràctica oral particulars de l’OTMM. Utilitzant diverses tècniques tal i com Dynamic Time Warping, Hough Transform o models de Markov de durada variable, la metodologia d’alineament esta dissenyada per enfrontar les diferències estructurals entre partitures musicals i enregistraments d’àudio. El mètode és robust inclús en presència d’expressions musicals no anotades en la partitura, desviacions de tempo ocorregudes en les actuacions musicals i diferències de tònica i afinació. La metodologia aprofita els resultats de l’anàlisi de la partitura i l’àudio per enllaçar la informació simbòlica amb l’àudio. A més, la tècnica d’alineament s’utilitza per obtenir descripcions de l’àudio fonamentades en la partitura. L’anàlisi de l’àudio fonamentat en la partitura no només simplifica les fases d’extracció de característiques d’àudio que requeririen de mètodes de processament d’àudio sofisticats, sinó que a més millora substancialment els resultats comparat amb altres mètodes d´ultima generació que només depenen de contingut d’àudio. Les metodologies d’anàlisi presentades s’han utilitzat per analitzar el corpus CompMusic Ottoman-Turkish makam music i s’han integrat en una aplicació web destinada al descobriment musical de tradicions culturals específiques. Algunes de les metodologies ja han sigut també aplicades a altres tradicions musicals com la Hindustani, la Carnàtica i la Grega. Seguint els preceptes de la investigació oberta totes les dades creades, eines computacionals i resultats dels anàlisis estan disponibles obertament. Tant les metodologies, les eines i el corpus en si mateix proporcionen àmplies oportunitats per recerques futures en diversos camps de recerca tal i com la musicologia computacional, la extracció d’informació musical i la educació musical. Traducció d’anglès a català per Oriol Romaní Picas.
APA, Harvard, Vancouver, ISO, and other styles
19

Yusan, Wang, and 王愚善. "Automatic Text Corpora Retrieval in Example-Based Machine Translation." Thesis, 2001. http://ndltd.ncl.edu.tw/handle/67545771395405026129.

Full text
Abstract:
碩士
國立成功大學
資訊管理研究所
89
Translation is often a matter of finding analogous examples in linguistic databanks and of discovering how a particular source language has been translated before. Example-based approaches for machine translation (MT) are generally viewed as alternatives to knowledge-based methods and the supplement of traditional rule-based methods, of which three major steps — analysis, transfer and generation — are included. Researchers have shown many advantages in the use of example-based approach for translation, one of which allows the user to select text corpora for a specific domain for the sake of efficiency in matching and the better quality of translation to the target language. However, the selection of text corpora is generally accomplished by those, if not a translator, who must be equipped with a good capability to read the source language as well as to justify the category of the text. As a result, the monolingual user is unable to take the advantage, but has to experience much longer time during the text corpus matching in the bilingual (or multilingual) knowledge base. In this study, the proposed approach, namely Automatic Text Corpora Retrieval (ATCR), is able to automate the process to identify the corpora to which the source text is mostly related.
APA, Harvard, Vancouver, ISO, and other styles
20

"A concept-space based multi-document text summarizer." 2001. http://library.cuhk.edu.hk/record=b5890766.

Full text
Abstract:
by Tang Ting Kap.
Thesis (M.Phil.)--Chinese University of Hong Kong, 2001.
Includes bibliographical references (leaves 88-94).
Abstracts in English and Chinese.
List of Figures --- p.vi
List of Tables --- p.vii
Chapter 1. --- INTRODUCTION --- p.1
Chapter 1.1 --- Information Overloading and Low Utilization --- p.2
Chapter 1.2 --- Problem Needs To Solve --- p.3
Chapter 1.3 --- Research Contributions --- p.4
Chapter 1.3.1 --- Using Concept Space in Summarization --- p.5
Chapter 1.3.2 --- New Extraction Method --- p.5
Chapter 1.3.3 --- Experiments on New System --- p.6
Chapter 1.4 --- Organization of This Thesis --- p.7
Chapter 2. --- LITERATURE REVIEW --- p.8
Chapter 2.1 --- Classical Approach --- p.8
Chapter 2.1.1 --- Luhn's Algorithm --- p.9
Chapter 2.1.2 --- Edumundson's Algorithm --- p.11
Chapter 2.2 --- Statistical Approach --- p.15
Chapter 2.3 --- Natural Language Processing Approach --- p.15
Chapter 3. --- PROPOSED SUMMARIZATION APPROACH --- p.18
Chapter 3.1 --- Direction of Summarization --- p.19
Chapter 3.2 --- Overview of Summarization Algorithm --- p.20
Chapter 3.2.1 --- Document Pre-processing --- p.21
Chapter 3.2.2 --- Vector Space Model --- p.23
Chapter 3.2.3 --- Sentence Extraction --- p.24
Chapter 3.3 --- Evaluation Method --- p.25
Chapter 3.3.1 --- "Recall, Precision and F-measure" --- p.25
Chapter 3.4 --- Advantage of Concept Space Approach --- p.26
Chapter 4. --- SYSTEM ARCHITECTURE --- p.27
Chapter 4.1 --- Converge Process --- p.28
Chapter 4.2 --- Diverge Process --- p.30
Chapter 4.3 --- Backward Search --- p.31
Chapter 5. --- CONVERGE PROCESS --- p.32
Chapter 5.1 --- Document Merging --- p.32
Chapter 5.2 --- Word Phrase Extraction --- p.34
Chapter 5.3 --- Automatic Indexing --- p.34
Chapter 5.4 --- Cluster Analysis --- p.35
Chapter 5.5 --- Hopfield Net Classification --- p.37
Chapter 6. --- DIVERGE PROCESS --- p.42
Chapter 6.1 --- Concept Terms Refinement --- p.42
Chapter 6.2 --- Sentence Selection --- p.43
Chapter 6.3 --- Backward Searching --- p.46
Chapter 7. --- EXPERIMENT AND RESEARCH FINDINGS --- p.48
Chapter 7.1 --- System-generated Summary v.s. Source Documents --- p.52
Chapter 7.1.1 --- Compression Ratio --- p.52
Chapter 7.1.2 --- Information Loss --- p.54
Chapter 7.2 --- System-generated Summary v.s. Human-generated Summary --- p.58
Chapter 7.2.1 --- Background of EXTRACTOR --- p.59
Chapter 7.2.2 --- Evaluation Method --- p.61
Chapter 7.3 --- Evaluation of different System-generated Summaries by Human Experts --- p.63
Chapter 8. --- CONCLUSIONS AND FUTURE RESEARCH --- p.68
Chapter 8.1 --- Conclusions --- p.68
Chapter 8.2 --- Future Work --- p.69
Chapter A. --- EXTRACTOR SYSTEM FLOW AND TEN-STEP PROCEDURE --- p.71
Chapter B. --- SUMMARY GENERATED BY MS WORD2000 --- p.75
Chapter C. --- SUMMARY GENERATED BY EXTRACTOR SOFTWARE --- p.76
Chapter D. --- SUMMARY GENERATED BY OUR SYSTEM --- p.77
Chapter E. --- SYSTEM-GENERATED WORD PHRASES FROM TEST SAMPLE --- p.78
Chapter F. --- WORD PHRASES IDENTIFIED BY SUBJECTS --- p.79
Chapter G. --- SAMPLE OF QUESTIONNAIRE --- p.84
Chapter H. --- RESULT OF QUESTIONNAIRE --- p.85
Chapter I. --- EVALUATION FOR DIVERGE PROCESS --- p.86
BIBLIOGRAPHY --- p.88
APA, Harvard, Vancouver, ISO, and other styles
21

"A probabilistic approach for automatic text filtering." 1998. http://library.cuhk.edu.hk/record=b5889506.

Full text
Abstract:
Low Kon Fan.
Thesis (M.Phil.)--Chinese University of Hong Kong, 1998.
Includes bibliographical references (leaves 165-168).
Abstract also in Chinese.
Abstract --- p.i
Acknowledgment --- p.iv
Chapter 1 --- Introduction --- p.1
Chapter 1.1 --- Overview of Information Filtering --- p.1
Chapter 1.2 --- Contributions --- p.4
Chapter 1.3 --- Organization of this thesis --- p.6
Chapter 2 --- Existing Approaches --- p.7
Chapter 2.1 --- Representational issues --- p.7
Chapter 2.1.1 --- Document Representation --- p.7
Chapter 2.1.2 --- Feature Selection --- p.11
Chapter 2.2 --- Traditional Approaches --- p.15
Chapter 2.2.1 --- NewsWeeder --- p.15
Chapter 2.2.2 --- NewT --- p.17
Chapter 2.2.3 --- SIFT --- p.19
Chapter 2.2.4 --- InRoute --- p.20
Chapter 2.2.5 --- Motivation of Our Approach --- p.21
Chapter 2.3 --- Probabilistic Approaches --- p.23
Chapter 2.3.1 --- The Naive Bayesian Approach --- p.25
Chapter 2.3.2 --- The Bayesian Independence Classifier Approach --- p.28
Chapter 2.4 --- Comparison --- p.31
Chapter 3 --- Our Bayesian Network Approach --- p.33
Chapter 3.1 --- Backgrounds of Bayesian Networks --- p.34
Chapter 3.2 --- Bayesian Network Induction Approach --- p.36
Chapter 3.3 --- Automatic Construction of Bayesian Networks --- p.38
Chapter 4 --- Automatic Feature Discretization --- p.50
Chapter 4.1 --- Predefined Level Discretization --- p.52
Chapter 4.2 --- Lloyd's algorithm . . > --- p.53
Chapter 4.3 --- Class Dependence Discretization --- p.55
Chapter 5 --- Experiments and Results --- p.59
Chapter 5.1 --- Document Collections --- p.60
Chapter 5.2 --- Batch Filtering Experiments --- p.63
Chapter 5.3 --- Batch Filtering Results --- p.65
Chapter 5.4 --- Incremental Session Filtering Experiments --- p.87
Chapter 5.5 --- Incremental Session Filtering Results --- p.88
Chapter 6 --- Conclusions and Future Work --- p.105
Appendix A --- p.107
Appendix B --- p.116
Appendix C --- p.126
Appendix D --- p.131
Appendix E --- p.145
APA, Harvard, Vancouver, ISO, and other styles
22

"New learning strategies for automatic text categorization." 2001. http://library.cuhk.edu.hk/record=b5890838.

Full text
Abstract:
Lai Kwok-yin.
Thesis (M.Phil.)--Chinese University of Hong Kong, 2001.
Includes bibliographical references (leaves 125-130).
Abstracts in English and Chinese.
Chapter 1 --- Introduction --- p.1
Chapter 1.1 --- Automatic Textual Document Categorization --- p.1
Chapter 1.2 --- Meta-Learning Approach For Text Categorization --- p.3
Chapter 1.3 --- Contributions --- p.6
Chapter 1.4 --- Organization of the Thesis --- p.7
Chapter 2 --- Related Work --- p.9
Chapter 2.1 --- Existing Automatic Document Categorization Approaches --- p.9
Chapter 2.2 --- Existing Meta-Learning Approaches For Information Retrieval --- p.14
Chapter 2.3 --- Our Meta-Learning Approaches --- p.20
Chapter 3 --- Document Pre-Processing --- p.22
Chapter 3.1 --- Document Representation --- p.22
Chapter 3.2 --- Classification Scheme Learning Strategy --- p.25
Chapter 4 --- Linear Combination Approach --- p.30
Chapter 4.1 --- Overview --- p.30
Chapter 4.2 --- Linear Combination Approach - The Algorithm --- p.33
Chapter 4.2.1 --- Equal Weighting Strategy --- p.34
Chapter 4.2.2 --- Weighting Strategy Based On Utility Measure --- p.34
Chapter 4.2.3 --- Weighting Strategy Based On Document Rank --- p.35
Chapter 4.3 --- Comparisons of Linear Combination Approach and Existing Meta-Learning Methods --- p.36
Chapter 4.3.1 --- LC versus Simple Majority Voting --- p.36
Chapter 4.3.2 --- LC versus BORG --- p.38
Chapter 4.3.3 --- LC versus Restricted Linear Combination Method --- p.38
Chapter 5 --- The New Meta-Learning Model - MUDOF --- p.40
Chapter 5.1 --- Overview --- p.41
Chapter 5.2 --- Document Feature Characteristics --- p.42
Chapter 5.3 --- Classification Errors --- p.44
Chapter 5.4 --- Linear Regression Model --- p.45
Chapter 5.5 --- The MUDOF Algorithm --- p.47
Chapter 6 --- Incorporating MUDOF into Linear Combination approach --- p.52
Chapter 6.1 --- Background --- p.52
Chapter 6.2 --- Overview of MUDOF2 --- p.54
Chapter 6.3 --- Major Components of the MUDOF2 --- p.57
Chapter 6.4 --- The MUDOF2 Algorithm --- p.59
Chapter 7 --- Experimental Setup --- p.66
Chapter 7.1 --- Document Collection --- p.66
Chapter 7.2 --- Evaluation Metric --- p.68
Chapter 7.3 --- Component Classification Algorithms --- p.71
Chapter 7.4 --- Categorical Document Feature Characteristics for MUDOF and MUDOF2 --- p.72
Chapter 8 --- Experimental Results and Analysis --- p.74
Chapter 8.1 --- Performance of Linear Combination Approach --- p.74
Chapter 8.2 --- Performance of the MUDOF Approach --- p.78
Chapter 8.3 --- Performance of MUDOF2 Approach --- p.87
Chapter 9 --- Conclusions and Future Work --- p.96
Chapter 9.1 --- Conclusions --- p.96
Chapter 9.2 --- Future Work --- p.98
Chapter A --- Details of Experimental Results for Reuters-21578 corpus --- p.99
Chapter B --- Details of Experimental Results for OHSUMED corpus --- p.114
Bibliography --- p.125
APA, Harvard, Vancouver, ISO, and other styles
23

"Automatic text categorization for information filtering." 1998. http://library.cuhk.edu.hk/record=b5889734.

Full text
Abstract:
Ho Chao Yang.
Thesis (M.Phil.)--Chinese University of Hong Kong, 1998.
Includes bibliographical references (leaves 157-163).
Abstract also in Chinese.
Abstract --- p.i
Acknowledgment --- p.iii
List of Figures --- p.viii
List of Tables --- p.xiv
Chapter 1 --- Introduction --- p.1
Chapter 1.1 --- Automatic Document Categorization --- p.1
Chapter 1.2 --- Information Filtering --- p.3
Chapter 1.3 --- Contributions --- p.6
Chapter 1.4 --- Organization of the Thesis --- p.7
Chapter 2 --- Related Work --- p.9
Chapter 2.1 --- Existing Automatic Document Categorization Approaches --- p.9
Chapter 2.1.1 --- Rule-Based Approach --- p.10
Chapter 2.1.2 --- Similarity-Based Approach --- p.13
Chapter 2.2 --- Existing Information Filtering Approaches --- p.19
Chapter 2.2.1 --- Information Filtering Systems --- p.19
Chapter 2.2.2 --- Filtering in TREC --- p.21
Chapter 3 --- Document Pre-Processing --- p.23
Chapter 3.1 --- Document Representation --- p.23
Chapter 3.2 --- Classification Scheme Learning Strategy --- p.26
Chapter 4 --- A New Approach - IBRI --- p.31
Chapter 4.1 --- Overview of Our New IBRI Approach --- p.31
Chapter 4.2 --- The IBRI Representation and Definitions --- p.34
Chapter 4.3 --- The IBRI Learning Algorithm --- p.37
Chapter 5 --- IBRI Experiments --- p.43
Chapter 5.1 --- Experimental Setup --- p.43
Chapter 5.2 --- Evaluation Metric --- p.45
Chapter 5.3 --- Results --- p.46
Chapter 6 --- A New Approach - GIS --- p.50
Chapter 6.1 --- Motivation of GIS --- p.50
Chapter 6.2 --- Similarity-Based Learning --- p.51
Chapter 6.3 --- The Generalized Instance Set Algorithm (GIS) --- p.58
Chapter 6.4 --- Using GIS Classifiers for Classification --- p.63
Chapter 6.5 --- Time Complexity --- p.64
Chapter 7 --- GIS Experiments --- p.68
Chapter 7.1 --- Experimental Setup --- p.68
Chapter 7.2 --- Results --- p.73
Chapter 8 --- A New Information Filtering Approach Based on GIS --- p.87
Chapter 8.1 --- Information Filtering Systems --- p.87
Chapter 8.2 --- GIS-Based Information Filtering --- p.90
Chapter 9 --- Experiments on GIS-based Information Filtering --- p.95
Chapter 9.1 --- Experimental Setup --- p.95
Chapter 9.2 --- Results --- p.100
Chapter 10 --- Conclusions and Future Work --- p.108
Chapter 10.1 --- Conclusions --- p.108
Chapter 10.2 --- Future Work --- p.110
Chapter A --- Sample Documents in the corpora --- p.111
Chapter B --- Details of Experimental Results of GIS --- p.120
Chapter C --- Computational Time of Reuters-21578 Experiments --- p.141
APA, Harvard, Vancouver, ISO, and other styles
24

Kuan-MingChou and 周冠銘. "Using automatic keywords extraction and text clustering methods for medical information retrieval improvement." Thesis, 2013. http://ndltd.ncl.edu.tw/handle/80362319360586009723.

Full text
Abstract:
碩士
國立成功大學
醫學資訊研究所
101
Because there are huge data on the web, it will get many duplicate and near-duplicate search results when we search on the web. The motivation of this thesis is that reduce the time of filtering the huge duplicate and near-duplicate information when user search. In this thesis, we propose a novel clustering method to solve near-duplicate problem. Our method transforms each document to a feature vector, where the weights are terms frequency of each corresponding words. For reducing the dimension of these feature vectors, we used principle component analysis to transform these vectors to another space. After PCA, we used cosine similarity to compute the similarity of each document. And then, we used EM algorithm and Neyman-Pearson hypothesis test to cluster the duplicate documents. We compared out results with K-means method results. The experiments show that our method is outperformer than K-means method.
APA, Harvard, Vancouver, ISO, and other styles
25

(9761117), Shayan Ali A. Akbar. "Source code search for automatic bug localization." Thesis, 2020.

Find full text
Abstract:
This dissertation advances the state-of-the-art in information retrieval (IR) based automatic bug localization for large software systems. We present techniques from three generations of IR based bug localization and compare their performances on our large and diverse bug localization dataset --- the Bugzbook dataset. The three generations span over fifteen years of research in mining software repositories for bug localization and include: (1) the generation of simple bag-of-words (BoW) based techniques, (2) the generation in which software-centric information such as bug and code change histories as well as structured information embedded in bug reports and code files are exploited to improve retrieval, and (3) the third and most recent generation in which order and semantic relationships between terms are modeled to improve the performance of bug localization systems. The dissertation also presents a novel technique called SCOR (Source Code Retrieval with Semantics and Order) which combines Markov Random Fields (MRF) based term-term ordering dependencies with semantic word vectors obtained from neural network based word embedding algorithms, such as word2vec, to better localize bugs in code files. The results presented in this dissertation show that while term-term ordering and semantic relationships significantly improve the performance when they are modeled separately in retrieval systems, the best precisions in retrieval are obtained when they are modeled together in a single retrieval system. We also show that the semantic representations of software terms learned by training the word embedding algorithm on a corpus of software repositories can be used to perform search in new software code repositories not present in the training corpus of the word embedding algorithm.
APA, Harvard, Vancouver, ISO, and other styles
26

"Automatic index generation for the free-text based database." Chinese University of Hong Kong, 1992. http://library.cuhk.edu.hk/record=b5887040.

Full text
Abstract:
by Leung Chi Hong.
Thesis (M.Phil.)--Chinese University of Hong Kong, 1992.
Includes bibliographical references (leaves 183-184).
Chapter Chapter one: --- Introduction --- p.1
Chapter Chapter two: --- Background knowledge and linguistic approaches of automatic indexing --- p.5
Chapter 2.1 --- Definition of index and indexing --- p.5
Chapter 2.2 --- Indexing methods and problems --- p.7
Chapter 2.3 --- Automatic indexing and human indexing --- p.8
Chapter 2.4 --- Different approaches of automatic indexing --- p.10
Chapter 2.5 --- Example of semantic approach --- p.11
Chapter 2.6 --- Example of syntactic approach --- p.14
Chapter 2.7 --- Comments on semantic and syntactic approaches --- p.18
Chapter Chapter three: --- Rationale and methodology of automatic index generation --- p.19
Chapter 3.1 --- Problems caused by natural language --- p.19
Chapter 3.2 --- Usage of word frequencies --- p.20
Chapter 3.3 --- Brief description of rationale --- p.24
Chapter 3.4 --- Automatic index generation --- p.27
Chapter 3.4.1 --- Training phase --- p.27
Chapter 3.4.1.1 --- Selection of training documents --- p.28
Chapter 3.4.1.2 --- Control and standardization of variants of words --- p.28
Chapter 3.4.1.3 --- Calculation of associations between words and indexes --- p.30
Chapter 3.4.1.4 --- Discarding false associations --- p.33
Chapter 3.4.2 --- Indexing phase --- p.38
Chapter 3.4.3 --- Example of automatic indexing --- p.41
Chapter 3.5 --- Related researches --- p.44
Chapter 3.6 --- Word diversity and its effect on automatic indexing --- p.46
Chapter 3.7 --- Factors affecting performance of automatic indexing --- p.60
Chapter 3.8 --- Application of semantic representation --- p.61
Chapter 3.8.1 --- Problem of natural language --- p.61
Chapter 3.8.2 --- Use of concept headings --- p.62
Chapter 3.8.3 --- Example of using concept headings in automatic indexing --- p.65
Chapter 3.8.4 --- Advantages of concept headings --- p.68
Chapter 3.8.5 --- Disadvantages of concept headings --- p.69
Chapter 3.9 --- Correctness prediction for proposed indexes --- p.78
Chapter 3.9.1 --- Example of using index proposing rate --- p.80
Chapter 3.10 --- Effect of subject matter on automatic indexing --- p.83
Chapter 3.11 --- Comparison with other indexing methods --- p.85
Chapter 3.12 --- Proposal for applying Chinese medical knowledge --- p.90
Chapter Chapter four: --- Simulations of automatic index generation --- p.93
Chapter 4.1 --- Training phase simulations --- p.93
Chapter 4.1.1 --- Simulation of association calculation (word diversity uncontrolled) --- p.94
Chapter 4.1.2 --- Simulation of association calculation (word diversity controlled) --- p.102
Chapter 4.1.3 --- Simulation of discarding false associations --- p.107
Chapter 4.2 --- Indexing phase simulation --- p.115
Chapter 4.3 --- Simulation of using concept headings --- p.120
Chapter 4.4 --- Simulation for testing performance of predicting index correctness --- p.125
Chapter 4.5 --- Summary --- p.128
Chapter Chapter five: --- Real case study in database of Chinese Medicinal Material Research Center --- p.130
Chapter 5.1 --- Selection of real documents --- p.130
Chapter 5.2 --- Case study one: Overall performance using real data --- p.132
Chapter 5.2.1 --- Sample results of automatic indexing for real documents --- p.138
Chapter 5.3 --- Case study two: Using multi-word terms --- p.148
Chapter 5.4 --- Case study three: Using concept headings --- p.152
Chapter 5.5 --- Case study four: Prediction of proposed index correctness --- p.156
Chapter 5.6 --- Case study five: Use of (Σ ΔRij) Fi to determine false association --- p.159
Chapter 5.7 --- Case study six: Effect of word diversity --- p.162
Chapter 5.8 --- Summary --- p.166
Chapter Chapter six: --- Conclusion --- p.168
Appendix A: List of stopwords --- p.173
Appendix B: Index terms used in case studies --- p.174
References --- p.183
APA, Harvard, Vancouver, ISO, and other styles
27

Vidal-Santos, Gerard. "Avaluació de processos de reconeixement d’entitats (NER) com a complement a interfícies de recuperació d’informació en dipòsits digitals complexos." Thesis, 2018. http://eprints.rclis.org/33589/1/VidalSantos_TFG_2018.pdf.

Full text
Abstract:
The aim of the study is to explore the use of unsupervised Named-Entity Recognition (NER) processes to generate descriptive metadata capable to assist information retrieval interfaces in large-scaled digital collections and support the construction of of more diverse knowledge representation models in academic libraries. For this purpose the study reviews some experiences and canonical literature in the use of automatized subject headings creation in libraries and archives environments as a leveraging tool in the overexploited use of search engines as main access points to retrieve assets in their catalogs and digital collections., focusing on the guidelines established by two articles that address this task from two complementary points of view: • van Hooland S, de Wilde M, Verborgh R, Steiner T, Van de Walle R. Exploring entity recognition and disambiguation for cultural heritage collections. DIGITAL SCHOLARSHIP IN THE HUMANITIES. 2013 Nov 1;30(2):262–79. • Zeng M. Using a Semantic Analysis Tool to Generate Subject Access Points: A Study Using Panofsky’s Theory and Two Research Samples. Knowledge Organization. 2014 Jan 1;440–51. The first one, provides the tools to generate named-entities in large scale samples of text and establishes the parameters to assess the suitability of this entities in a quantitative level. The second one provides the guidelines to analyze the quality of those results by developing a 3 layered framework (identification-description-interpretation) based on Edward Panofsky’s work in the analysis and interpretation of pictorial works. A work environment is built on this premise to extract and analyze the entities detected by DBPedia Spotlight (the NER service used for extraction) in a random collection of bibliographic records extracted from a thesis aggregator (Open Acces Thesis & Dissertations). The results shows the great improve on descriptive access points provided by this processes at a quantitative basis, allowing users to browse more effectively in better contextualized records if combined with the keywords already indexed, despite not having the necessary consistency to successfully surpass the quality filter established in the evaluation table. This setback, however, conditions in a relative way the possibility of improving the visibility of record in large collections by these means if the logical constructions from the semantic basis that manages the extraction service is taken into consideration on iterative cataloging processes, establishing a iterative and cost-effective way of constructing more diverse maps of knowledge graphs connecting manual or self-generated indexed keywords to others nodes in the linked open data (LOD) cloud.
APA, Harvard, Vancouver, ISO, and other styles
28

Vidal-Santos, Gerard. "Avaluació de processos de reconeixement d’entitats (NER) com a complement a interfícies de recuperació d’informació en dipòsits digitals complexos." Thesis, 2018. http://eprints.rclis.org/33692/1/VidalSantos_TFG_2018.pdf.

Full text
Abstract:
The aim of the study is to explore the use of unsupervised Named-Entity Recognition (NER) processes to generate descriptive metadata capable to assist information retrieval interfaces in large-scaled digital collections and support the construction of of more diverse knowledge representation models in academic libraries. For this purpose the study reviews some experiences and canonical literature in the use of automatized subject headings creation in libraries and archives environments as a leveraging tool in the overexploited use of search engines as main access points to retrieve assets in their catalogs and digital collections., focusing on the guidelines established by two articles that address this task from two complementary points of view: • van Hooland S, de Wilde M, Verborgh R, Steiner T, Van de Walle R. Exploring entity recognition and disambiguation for cultural heritage collections. DIGITAL SCHOLARSHIP IN THE HUMANITIES. 2013 Nov 1;30(2):262–79. • Zeng M. Using a Semantic Analysis Tool to Generate Subject Access Points: A Study Using Panofsky’s Theory and Two Research Samples. Knowledge Organization. 2014 Jan 1;440–51. The first one, provides the tools to generate named-entities in large scale samples of text and establishes the parameters to assess the suitability of this entities in a quantitative level. The second one provides the guidelines to analyze the quality of those results by developing a 3 layered framework (identification-description-interpretation) based on Edward Panofsky’s work in the analysis and interpretation of pictorial works. A work environment is built on this premise to extract and analyze the entities detected by DBPedia Spotlight (the NER service used for extraction) in a random collection of bibliographic records extracted from a thesis aggregator (Open Acces Thesis & Dissertations). The results shows the great improve on descriptive access points provided by this processes at a quantitative basis, allowing users to browse more effectively in better contextualized records if combined with the keywords already indexed, despite not having the necessary consistency to successfully surpass the quality filter established in the evaluation table. This setback, however, conditions in a relative way the possibility of improving the visibility of record in large collections by these means if the logical constructions from the semantic basis that manages the extraction service is taken into consideration on iterative cataloging processes, establishing a iterative and cost-effective way of constructing more diverse maps of knowledge graphs connecting manual or self-generated indexed keywords to others nodes in the linked open data (LOD) cloud.
APA, Harvard, Vancouver, ISO, and other styles
29

Khoo, Christopher S. G. "Automatic identification of causal relations in text and their use for improving precision in information retrieval." Thesis, 1995. http://hdl.handle.net/10150/105106.

Full text
Abstract:
Parts of the thesis were published in: 1. Khoo, C., Myaeng, S.H., & Oddy, R. (2001). Using cause-effect relations in text to improve information retrieval precision. Information Processing and Management, 37(1), 119-145. 2. Khoo, C., Kornfilt, J., Oddy, R., & Myaeng, S.H. (1998). Automatic extraction of cause-effect information from newspaper text without knowledge-based inferencing. Literary & Linguistic Computing, 13(4), 177-186. 3. Khoo, C. (1997). The use of relation matching in information retrieval. LIBRES: Library and Information Science Research Electronic Journal [Online], 7(2). Available at: http://aztec.lib.utk.edu/libres/libre7n2/. An update of the literature review on causal relations in text was published in: Khoo, C., Chan, S., & Niu, Y. (2002). The many facets of the cause-effect relation. In R.Green, C.A. Bean & S.H. Myaeng (Eds.), The semantics of relationships: An interdisciplinary perspective (pp. 51-70). Dordrecht: Kluwer
This study represents one attempt to make use of relations expressed in text to improve information retrieval effectiveness. In particular, the study investigated whether the information obtained by matching causal relations expressed in documents with the causal relations expressed in users' queries could be used to improve document retrieval results in comparison to using just term matching without considering relations. An automatic method for identifying and extracting cause-effect information in Wall Street Journal text was developed. The method uses linguistic clues to identify causal relations without recourse to knowledge-based inferencing. The method was successful in identifying and extracting about 68% of the causal relations that were clearly expressed within a sentence or between adjacent sentences in Wall Street Journal text. Of the instances that the computer program identified as causal relations, 72% can be considered to be correct. The automatic method was used in an experimental information retrieval system to identify causal relations in a database of full-text Wall Street Journal documents. Causal relation matching was found to yield a small but significant improvement in retrieval results when the weights used for combining the scores from different types of matching were customized for each query -- as in an SDI or routing queries situation. The best results were obtained when causal relation matching was combined with word proximity matching (matching pairs of causally related words in the query with pairs of words that co-occur within document sentences). An analysis using manually identified causal relations indicate that bigger retrieval improvements can be expected with more accurate identification of causal relations. The best kind of causal relation matching was found to be one in which one member of the causal relation (either the cause or the effect) was represented as a wildcard that could match with any term. The study also investigated whether using Roget's International Thesaurus (3rd ed.) to expand query terms with synonymous and related terms would improve retrieval effectiveness. Using Roget category codes in addition to keywords did give better retrieval results. However, the Roget codes were better at identifying the non-relevant documents than the relevant ones. Parts of the thesis were published in: 1. Khoo, C., Myaeng, S.H., & Oddy, R. (2001). Using cause-effect relations in text to improve information retrieval precision. Information Processing and Management, 37(1), 119-145. 2. Khoo, C., Kornfilt, J., Oddy, R., & Myaeng, S.H. (1998). Automatic extraction of cause-effect information from newspaper text without knowledge-based inferencing. Literary & Linguistic Computing, 13(4), 177-186. 3. Khoo, C. (1997). The use of relation matching in information retrieval. LIBRES: Library and Information Science Research Electronic Journal [Online], 7(2). Available at: http://aztec.lib.utk.edu/libres/libre7n2/. An update of the literature review on causal relations in text was published in: Khoo, C., Chan, S., & Niu, Y. (2002). The many facets of the cause-effect relation. In R.Green, C.A. Bean & S.H. Myaeng (Eds.), The semantics of relationships: An interdisciplinary perspective (pp. 51-70). Dordrecht: Kluwer
APA, Harvard, Vancouver, ISO, and other styles
30

Williams, Kyle. "Learning to Read Bushman: Automatic Handwriting Recognition for Bushman Languages." Thesis, 2012. http://pubs.cs.uct.ac.za/archive/00000791/.

Full text
Abstract:
The Bleek and Lloyd Collection contains notebooks that document the tradition, language and culture of the Bushman people who lived in South Africa in the late 19th century. Transcriptions of these notebooks would allow for the provision of services such as text-based search and text-to-speech. However, these notebooks are currently only available in the form of digital scans and the manual creation of transcriptions is a costly and time-consuming process. Thus, automatic methods could serve as an alternative approach to creating transcriptions of the text in the notebooks. In order to evaluate the use of automatic methods, a corpus of Bushman texts and their associated transcriptions was created. The creation of this corpus involved: the development of a custom method for encoding the Bushman script, which contains complex diacritics; the creation of a tool for creating and transcribing the texts in the notebooks; and the running of a series of workshops in which the tool was used to create the corpus. The corpus was used to evaluate the use of various techniques for automatically transcribing the texts in the corpus in order to determine which approaches were best suited to the complex Bushman script. These techniques included the use of Support Vector Machines, Artificial Neural Networks and Hidden Markov Models as machine learning algorithms, which were coupled with different descriptive features. The effect of the texts used for training the machine learning algorithms was also investigated as well as the use of a statistical language model. It was found that, for Bushman word recognition, the use of a Support Vector Machine with Histograms of Oriented Gradient features resulted in the best performance and, for Bushman text line recognition, Marti & Bunke features resulted in the best performance when used with Hidden Markov Models. The automatic transcription of the Bushman texts proved to be difficult and the performance of the different recognition systems was largely affected by the complexities of the Bushman script. It was also found that, besides having an influence on determining which techniques may be the most appropriate for automatic handwriting recognition, the texts used in a automatic handwriting recognition system also play a large role in determining whether or not automatic recognition should be attempted at all.
APA, Harvard, Vancouver, ISO, and other styles
31

"Automatic construction of wrappers for semi-structured documents." 2001. http://library.cuhk.edu.hk/record=b5890663.

Full text
Abstract:
Lin Wai-yip.
Thesis (M.Phil.)--Chinese University of Hong Kong, 2001.
Includes bibliographical references (leaves 114-123).
Abstracts in English and Chinese.
Chapter 1 --- Introduction --- p.1
Chapter 1.1 --- Information Extraction --- p.1
Chapter 1.2 --- IE from Semi-structured Documents --- p.3
Chapter 1.3 --- Thesis Contributions --- p.7
Chapter 1.4 --- Thesis Organization --- p.9
Chapter 2 --- Related Work --- p.11
Chapter 2.1 --- Existing Approaches --- p.11
Chapter 2.2 --- Limitations of Existing Approaches --- p.18
Chapter 2.3 --- Our HISER Approach --- p.20
Chapter 3 --- System Overview --- p.23
Chapter 3.1 --- Hierarchical record Structure and Extraction Rule learning (HISER) --- p.23
Chapter 3.2 --- Hierarchical Record Structure --- p.29
Chapter 3.3 --- Extraction Rule --- p.29
Chapter 3.4 --- Wrapper Adaptation --- p.32
Chapter 4 --- Automatic Hierarchical Record Structure Construction --- p.34
Chapter 4.1 --- Motivation --- p.34
Chapter 4.2 --- Hierarchical Record Structure Representation --- p.36
Chapter 4.3 --- Constructing Hierarchical Record Structure --- p.38
Chapter 5 --- Extraction Rule Induction --- p.43
Chapter 5.1 --- Rule Representation --- p.43
Chapter 5.2 --- Extraction Rule Induction Algorithm --- p.47
Chapter 6 --- Experimental Results of Wrapper Learning --- p.54
Chapter 6.1 --- Experimental Methodology --- p.54
Chapter 6.2 --- Results on Electronic Appliance Catalogs --- p.56
Chapter 6.3 --- Results on Book Catalogs --- p.60
Chapter 6.4 --- Results on Seminar Announcements --- p.62
Chapter 7 --- Adapting Wrappers to Unseen Information Sources --- p.69
Chapter 7.1 --- Motivation --- p.69
Chapter 7.2 --- Support Vector Machines --- p.72
Chapter 7.3 --- Feature Selection --- p.76
Chapter 7.4 --- Automatic Annotation of Training Examples --- p.80
Chapter 7.4.1 --- Building SVM Models --- p.81
Chapter 7.4.2 --- Seeking Potential Training Example Candidates --- p.82
Chapter 7.4.3 --- Classifying Potential Training Examples --- p.84
Chapter 8 --- Experimental Results of Wrapper Adaptation --- p.86
Chapter 8.1 --- Experimental Methodology --- p.86
Chapter 8.2 --- Results on Electronic Appliance Catalogs --- p.89
Chapter 8.3 --- Results on Book Catalogs --- p.93
Chapter 9 --- Conclusions and Future Work --- p.97
Chapter 9.1 --- Conclusions --- p.97
Chapter 9.2 --- Future Work --- p.100
Chapter A --- Sample Experimental Pages --- p.101
Chapter B --- Detailed Experimental Results of Wrapper Adaptation of HISER --- p.109
Bibliography --- p.114
APA, Harvard, Vancouver, ISO, and other styles
32

Çapkın, Çağdaş. "Türkçe metin tabanlı açık arşivlerde kullanılan dizinleme yönteminin değerlendirilmesi / Evaluation of indexing method used in Turkish text-based open archives." Thesis, 2011. http://eprints.rclis.org/28804/1/Cagdas_CAPKIN_Yuksek_Lisans_tezi.pdf.

Full text
Abstract:
The purpose of this research is to evaluate performance of information retrieval systems designed for open archives, and standards/protocols enabling retrieving and organizing information in open archives. In this regard, an open archive was developed with 2215 text-based documents from "Turkish Librarianship" journal and three different information retrieval systems based on Boolean and Vector Space models were designed in order to evaluate information retrieval performances in the open archive developed. The designed information retrieval systems are: "metadata information retrieval system" (ÜBES) involving indexing with metadata created based only on human, "full-text information retrieval system" (TBES) involving (automatic) indexing based on only machine, and "mixed information retrieval system" (KBES) involving indexing based both on human and machine. Descriptive research method is used to describe the current situation and findings are evaluated based on literature. In order to evaluate performances of information retrieval systems, "precision and recall" and "normalized recall" measurements are made. The following results are found at the end of the research: It is determined that the precision performance of KBES information retrieval system designed for open archives creates statistically significant difference in comparison to ÜBES and TBES. In each information retrieval system, a strong negative correlation is identified between recall and precision, where precision decreases as recall increases. It is determined that the "normalized recall" performance of ÜBES and KBES create statistically significant difference in comparison to TBES. In "normalized recall" performance, no statistically significant difference is identified between ÜBES and KBES. ÜBES is the information retrieval system through which minimum number of relevant and nonrelevant documents; TBES, through which maximum number of nonrelevant and second most relevant documents, and KBES, through which maximum number of relevant and second most nonrelevant documents are retrieved. It is concluded that using OAI-PMH and OAI-ORE protocols together rather than using only OAI-PMH protocol fits the purpose of open archives.
APA, Harvard, Vancouver, ISO, and other styles
33

"Automatic construction and adaptation of wrappers for semi-structured web documents." 2003. http://library.cuhk.edu.hk/record=b5891460.

Full text
Abstract:
Wong Tak Lam.
Thesis (M.Phil.)--Chinese University of Hong Kong, 2003.
Includes bibliographical references (leaves 88-94).
Abstracts in English and Chinese.
Chapter 1 --- Introduction --- p.1
Chapter 1.1 --- Wrapper Induction for Semi-structured Web Documents --- p.1
Chapter 1.2 --- Adapting Wrappers to Unseen Web Sites --- p.6
Chapter 1.3 --- Thesis Contributions --- p.7
Chapter 1.4 --- Thesis Organization --- p.8
Chapter 2 --- Related Work --- p.10
Chapter 2.1 --- Related Work on Wrapper Induction --- p.10
Chapter 2.2 --- Related Work on Wrapper Adaptation --- p.16
Chapter 3 --- Automatic Construction of Hierarchical Wrappers --- p.20
Chapter 3.1 --- Hierarchical Record Structure Inference --- p.22
Chapter 3.2 --- Extraction Rule Induction --- p.30
Chapter 3.3 --- Applying Hierarchical Wrappers --- p.38
Chapter 4 --- Experimental Results for Wrapper Induction --- p.40
Chapter 5 --- Adaptation of Wrappers for Unseen Web Sites --- p.52
Chapter 5.1 --- Problem Definition --- p.52
Chapter 5.2 --- Overview of Wrapper Adaptation Framework --- p.55
Chapter 5.3 --- Potential Training Example Candidate Identification --- p.58
Chapter 5.3.1 --- Useful Text Fragments --- p.58
Chapter 5.3.2 --- Training Example Generation from the Unseen Web Site --- p.60
Chapter 5.3.3 --- Modified Nearest Neighbour Classification --- p.63
Chapter 5.4 --- Machine Annotated Training Example Discovery and New Wrap- per Learning --- p.64
Chapter 5.4.1 --- Text Fragment Classification --- p.64
Chapter 5.4.2 --- New Wrapper Learning --- p.69
Chapter 6 --- Case Study and Experimental Results for Wrapper Adapta- tion --- p.71
Chapter 6.1 --- Case Study on Wrapper Adaptation --- p.71
Chapter 6.2 --- Experimental Results --- p.73
Chapter 6.2.1 --- Book Domain --- p.74
Chapter 6.2.2 --- Consumer Electronic Appliance Domain --- p.79
Chapter 7 --- Conclusions and Future Work --- p.83
Bibliography --- p.88
Chapter A --- Detailed Performance of Wrapper Induction for Book Do- main --- p.95
Chapter B --- Detailed Performance of Wrapper Induction for Consumer Electronic Appliance Domain --- p.99
APA, Harvard, Vancouver, ISO, and other styles
34

Sifuentes, Raul. "Determinación de las causas de búsquedas sin resultados en un catálogo bibliográfico en línea mediante el análisis de bitácoras de transacciones : el caso de la Pontificia Universidad Católica del Perú." Thesis, 2013. http://eprints.rclis.org/23857/1/SIFUENTES_ARROYO_RAUL_CATALOGO_BIBLIOGRAFICO.pdf.

Full text
Abstract:
The present investigation is aimed at determining the causes of searches with zero results in the OPAC of the Sistema de Bibliotecas Pontificia Universidad Catolica del Peru during 2012 year. For this purpose, a log transaction analysis of the OPAC´s searches was done as a methodology. Three causes of searches with zero result were found: 1) Terms mismatched: words or phrases written correctly in search statements which do not match those used in the bibliographic description and subject terminology assigned to each bibliographic record. 2) Wrong writing in search statements due to typographical and spelling mistakes. 3) Wrong index selection: when a user selects a wrong index for his/her search statement. A more detailed analysis was done for each OPAC's index. Suggestions are offered regarding to reinforcement in OPAC searches training, improve search engine interface, and use of log transactions analisys of OPAC´s searches as a methodology to detect library materials demand not in its physical and virtual collections.
APA, Harvard, Vancouver, ISO, and other styles
35

"Statistical modeling for lexical chains for automatic Chinese news story segmentation." 2010. http://library.cuhk.edu.hk/record=b5894500.

Full text
Abstract:
Chan, Shing Kai.
Thesis (M.Phil.)--Chinese University of Hong Kong, 2010.
Includes bibliographical references (leaves 106-114).
Abstracts in English and Chinese.
Abstract --- p.i
Acknowledgements --- p.v
Chapter 1 --- Introduction --- p.1
Chapter 1.1 --- Problem Statement --- p.2
Chapter 1.2 --- Motivation for Story Segmentation --- p.4
Chapter 1.3 --- Terminologies --- p.5
Chapter 1.4 --- Thesis Goals --- p.6
Chapter 1.5 --- Thesis Organization --- p.8
Chapter 2 --- Background Study --- p.9
Chapter 2.1 --- Coherence-based Approaches --- p.10
Chapter 2.1.1 --- Defining Coherence --- p.10
Chapter 2.1.2 --- Lexical Chaining --- p.12
Chapter 2.1.3 --- Cosine Similarity --- p.15
Chapter 2.1.4 --- Language Modeling --- p.19
Chapter 2.2 --- Feature-based Approaches --- p.21
Chapter 2.2.1 --- Lexical Cues --- p.22
Chapter 2.2.2 --- Audio Cues --- p.23
Chapter 2.2.3 --- Video Cues --- p.24
Chapter 2.3 --- Pros and Cons and Hybrid Approaches --- p.25
Chapter 2.4 --- Chapter Summary --- p.27
Chapter 3 --- Experimental Corpora --- p.29
Chapter 3.1 --- The TDT2 and TDT3 Multi-language Text Corpus --- p.29
Chapter 3.1.1 --- Introduction --- p.29
Chapter 3.1.2 --- Program Particulars and Structures --- p.31
Chapter 3.2 --- Data Preprocessing --- p.33
Chapter 3.2.1 --- Challenges of Lexical Chain Formation on Chi- nese Text --- p.33
Chapter 3.2.2 --- Word Segmentation for Word Units Extraction --- p.35
Chapter 3.2.3 --- Part-of-speech Tagging for Candidate Words Ex- traction --- p.36
Chapter 3.3 --- Chapter Summary --- p.37
Chapter 4 --- Indication of Lexical Cohesiveness by Lexical Chains --- p.39
Chapter 4.1 --- Lexical Chain as a Representation of Cohesiveness --- p.40
Chapter 4.1.1 --- Choice of Word Relations for Lexical Chaining --- p.41
Chapter 4.1.2 --- Lexical Chaining by Connecting Repeated Lexi- cal Elements --- p.43
Chapter 4.2 --- Lexical Chain as an Indicator of Story Segments --- p.48
Chapter 4.2.1 --- Indicators of Absence of Cohesiveness --- p.49
Chapter 4.2.2 --- Indicator of Continuation of Cohesiveness --- p.58
Chapter 4.3 --- Chapter Summary --- p.62
Chapter 5 --- Indication of Story Boundaries by Lexical Chains --- p.63
Chapter 5.1 --- Formal Definition of the Classification Procedures --- p.64
Chapter 5.2 --- Theoretical Framework for Segmentation Based on Lex- ical Chaining --- p.65
Chapter 5.2.1 --- Evaluation of Story Segmentation Accuracy --- p.65
Chapter 5.2.2 --- Previous Approach of Story Segmentation Based on Lexical Chaining --- p.66
Chapter 5.2.3 --- Statistical Framework for Story Segmentation based on Lexical Chaining --- p.69
Chapter 5.2.4 --- Post Processing of Ratio for Boundary Identifi- cation --- p.73
Chapter 5.3 --- Comparing Segmentation Models --- p.75
Chapter 5.4 --- Chapter Summary --- p.79
Chapter 6 --- Analysis of Lexical Chains Features as Boundary Indi- cators --- p.80
Chapter 6.1 --- Error Analysis --- p.81
Chapter 6.2 --- Window Length in the LRT Model --- p.82
Chapter 6.3 --- The Relative Importance of Each Set of Features --- p.84
Chapter 6.4 --- The Effect of Removing Timing Information --- p.92
Chapter 6.5 --- Chapter Summary --- p.96
Chapter 7 --- Conclusions and Future Work --- p.98
Chapter 7.1 --- Contributions --- p.98
Chapter 7.2 --- Future Works --- p.100
Chapter 7.2.1 --- Further Extension of the Framework --- p.100
Chapter 7.2.2 --- Wider Applications of the Framework --- p.105
Bibliography --- p.106
APA, Harvard, Vancouver, ISO, and other styles
36

Cerdeirinha, João Manuel Macedo. "Recuperação de imagens digitais com base no conteúdo: estudo na Biblioteca de Arte e Arquivos da Fundação Calouste Gulbenkian." Master's thesis, 2019. http://hdl.handle.net/10362/91474.

Full text
Abstract:
O crescimento massivo de dados multimédia na Internet e surgimento de novas plataformas de partilha criou grandes desafios para a recuperação de informação. As limitações de pesquisas com base em texto para este tipo de conteúdo proporcionaram o desenvolvimento de uma abordagem de recuperação de informação com base no conteúdo (CBIR) que recebeu atenção crescente nas últimas décadas. Tendo em conta as pesquisas realizadas nesta área, e sendo o foco desta investigação as imagens digitais, são explorados conceitos e técnicas associadas a esta abordagem por meio de um levantamento teórico que relata a evolução da recuperação de informação e a importância que esta temática tem para a Gestão e Curadoria da Informação. No contexto dos sistemas que têm vindo a ser desenvolvidos recorrendo à indexação automática, indicam-se as diversas aplicações deste tipo de processo. São também identificadas ferramentas relacionadas disponíveis para a realização de um estudo da aplicação deste tipo de recuperação de imagens no contexto da Biblioteca de Arte e Arquivos da Fundação Calouste Gulbenkian e as coleções fotográficas que esta é detentora no seu acervo, considerando as particularidades da instituição em que estas se inserem. Para a demonstração pretendida e de acordo com os critérios estabelecidos, recorreu-se inicialmente a soluções CBIR disponíveis em linha e, numa fase seguinte, foi usada uma ferramenta de instalação local numa coleção específica. Através deste estudo, são atestados os pontos fortes e pontos fracos da recuperação de imagens digitais com base no conteúdo face à abordagem mais tradicional com base em metainformação textual em vigor atualmente nessas coleções. Tendo em consideração as necessidades dos utilizadores dos sistemas em que estes objetos digitais se encontram indexados, a combinação entre estas técnicas pode originar resultados mais satisfatórios.
The massive growth of multimedia data on the Internet and the emergence of new sharing platforms created major challenges for information retrieval. The limitations of text-based searches for this type of content have led to the development of a content-based information retrieval approach that has received increasing attention in recent decades. Taking into account the research carried out in this area, and digital images being the focus of this research, concepts and techniques associated with this approach are explored through a theoretical survey that reports the evolution of information retrieval and the importance that this subject has for Information Management and Curation. In the context of the systems that have been developed using automatic indexing, the various applications of this type of process are indicated. Available CBIR tools are also identified for a case study of the application of this type of image retrieval in the context of the Art Library and Archives of the Calouste Gulbenkian Foundation and the photographic collections that it holds in its resources, considering the particularities of the institution to which they belong. For the intended demonstration and according to the established criteria, online CBIR tools were initially used and, in the following phase, locally installed software was selected to search and retrieve in a specific collection. Through this case study, the strengths and weaknesses of content-based image retrieval are attested against the more traditional approach based on textual metadata currently in use in these collections. Taking into consideration the needs of users of the systems in which these digital objects are indexed, combining these techniques may lead to more satisfactory results.
APA, Harvard, Vancouver, ISO, and other styles
37

Moreira, Walter. "Biblioteca tradicional x biblioteca virtual: modelos de recuperação da informação." Thesis, 1998. http://eprints.rclis.org/8353/1/BibliotecaTradicionalXBibliotecaVirtual_ModelosDeRecuperacaoDaInformacao.pdf.

Full text
Abstract:
Taking libraries as "technologies of intelligence" the author develops a reflection where tradicional library is "opposed" to its virtual counterpart. Models of information retrieval are discussed within the two mentioned models of libraries. New virtual concepts as "navigation" or "surfing" are compared with those present in more traditional information retrieval systems, as "the best match". Here are some other opposite concepts discussed: fuzzy set theory opposed to boolean logic; the interrelation between the basis of information technology and the development of information retrieval tools are also under scrutinity. At last, it discusses the specificity of hypermidia environment which poses new questions to the development of Information Retrieval Theory. The radical differences perceived between tradicional and virtual libraries does not leads the author to a polarization when considering the future. Although digitalization is an irreversible tendency related to information world, the author concludes by the complementarity of the two models of libraries.
APA, Harvard, Vancouver, ISO, and other styles
38

Wächter, Thomas. "Semi-automated Ontology Generation for Biocuration and Semantic Search." Doctoral thesis, 2010. https://tud.qucosa.de/id/qucosa%3A25496.

Full text
Abstract:
Background: In the life sciences, the amount of literature and experimental data grows at a tremendous rate. In order to effectively access and integrate these data, biomedical ontologies – controlled, hierarchical vocabularies – are being developed. Creating and maintaining such ontologies is a difficult, labour-intensive, manual process. Many computational methods which can support ontology construction have been proposed in the past. However, good, validated systems are largely missing. Motivation: The biocuration community plays a central role in the development of ontologies. Any method that can support their efforts has the potential to have a huge impact in the life sciences. Recently, a number of semantic search engines were created that make use of biomedical ontologies for document retrieval. To transfer the technology to other knowledge domains, suitable ontologies need to be created. One area where ontologies may prove particularly useful is the search for alternative methods to animal testing, an area where comprehensive search is of special interest to determine the availability or unavailability of alternative methods. Results: The Dresden Ontology Generator for Directed Acyclic Graphs (DOG4DAG) developed in this thesis is a system which supports the creation and extension of ontologies by semi-automatically generating terms, definitions, and parent-child relations from text in PubMed, the web, and PDF repositories. The system is seamlessly integrated into OBO-Edit and Protégé, two widely used ontology editors in the life sciences. DOG4DAG generates terms by identifying statistically significant noun-phrases in text. For definitions and parent-child relations it employs pattern-based web searches. Each generation step has been systematically evaluated using manually validated benchmarks. The term generation leads to high quality terms also found in manually created ontologies. Definitions can be retrieved for up to 78% of terms, child ancestor relations for up to 54%. No other validated system exists that achieves comparable results. To improve the search for information on alternative methods to animal testing an ontology has been developed that contains 17,151 terms of which 10% were newly created and 90% were re-used from existing resources. This ontology is the core of Go3R, the first semantic search engine in this field. When a user performs a search query with Go3R, the search engine expands this request using the structure and terminology of the ontology. The machine classification employed in Go3R is capable of distinguishing documents related to alternative methods from those which are not with an F-measure of 90% on a manual benchmark. Approximately 200,000 of the 19 million documents listed in PubMed were identified as relevant, either because a specific term was contained or due to the automatic classification. The Go3R search engine is available on-line under www.Go3R.org.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography