Dissertations / Theses on the topic 'Automatic text retrieval'
Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles
Consult the top 38 dissertations / theses for your research on the topic 'Automatic text retrieval.'
Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.
You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.
Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.
Viana, Hugo Henrique Amorim. "Automatic information retrieval through text-mining." Master's thesis, Faculdade de Ciências e Tecnologia, 2013. http://hdl.handle.net/10362/11308.
Full textNowadays, around a huge amount of firms in the European Union catalogued as Small and Medium Enterprises (SMEs), employ almost a great portion of the active workforce in Europe. Nonetheless, SMEs cannot afford implementing neither methods nor tools to systematically adapt innovation as a part of their business process. Innovation is the engine to be competitive in the globalized environment, especially in the current socio-economic situation. This thesis provides a platform that when integrated with ExtremeFactories(EF) project, aids SMEs to become more competitive by means of monitoring schedule functionality. In this thesis a text-mining platform that possesses the ability to schedule a gathering information through keywords is presented. In order to develop the platform, several choices concerning the implementation have been made, in the sense that one of them requires particular emphasis is the framework, Apache Lucene Core 2 by supplying an efficient text-mining tool and it is highly used for the purpose of the thesis.
Lee, Hyo Sook. "Automatic text processing for Korean language free text retrieval." Thesis, University of Sheffield, 2000. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.322916.
Full textKay, Roderick Neil. "Text analysis, summarising and retrieval." Thesis, University of Salford, 1997. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.360435.
Full textMcMurtry, William F. "Information Retrieval for Call Center Quality Assurance." The Ohio State University, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=osu1587036885211228.
Full textGoyal, Pawan. "Analytic knowledge discovery techniques for ad-hoc information retrieval and automatic text summarization." Thesis, Ulster University, 2011. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.543897.
Full textErmakova, Liana. "Short text contextualization in information retrieval : application to tweet contextualization and automatic query expansion." Thesis, Toulouse 2, 2016. http://www.theses.fr/2016TOU20023/document.
Full textThe efficient communication tends to follow the principle of the least effort. According to this principle, using a given language interlocutors do not want to work any harder than necessary to reach understanding. This fact leads to the extreme compression of texts especially in electronic communication, e.g. microblogs, SMS, search queries. However, sometimes these texts are not self-contained and need to be explained since understanding them requires knowledge of terminology, named entities or related facts. The main goal of this research is to provide a context to a user or a system from a textual resource.The first aim of this work is to help a user to better understand a short message by extracting a context from an external source like a text collection, the Web or the Wikipedia by means of text summarization. To this end we developed an approach for automatic multi-document summarization and we applied it to short message contextualization, in particular to tweet contextualization. The proposed method is based on named entity recognition, part-of-speech weighting and sentence quality measuring. In contrast to previous research, we introduced an algorithm for smoothing from the local context. Our approach exploits topic-comment structure of a text. Moreover, we developed a graph-based algorithm for sentence reordering. The method has been evaluated at INEX/CLEF tweet contextualization track. We provide the evaluation results over the 4 years of the track. The method was also adapted to snippet retrieval. The evaluation results indicate good performance of the approach
Sequeira, José Francisco Rodrigues. "Automatic knowledge base construction from unstructured text." Master's thesis, Universidade de Aveiro, 2016. http://hdl.handle.net/10773/17910.
Full textTaking into account the overwhelming number of biomedical publications being produced, the effort required for a user to efficiently explore those publications in order to establish relationships between a wide range of concepts is staggering. This dissertation presents GRACE, a web-based platform that provides an advanced graphical exploration interface that allows users to traverse the biomedical domain in order to find explicit and latent associations between annotated biomedical concepts belonging to a variety of semantic types (e.g., Genes, Proteins, Disorders, Procedures and Anatomy). The knowledge base utilized is a collection of MEDLINE articles with English abstracts. These annotations are then stored in an efficient data storage that allows for complex queries and high-performance data delivery. Concept relationship are inferred through statistical analysis, applying association measures to annotated terms. These processes grant the graphical interface the ability to create, in real-time, a data visualization in the form of a graph for the exploration of these biomedical concept relationships.
Tendo em conta o crescimento do número de publicações biomédicas a serem produzidas todos os anos, o esforço exigido para que um utilizador consiga, de uma forma eficiente, explorar estas publicações para conseguir estabelecer associações entre um conjunto alargado de conceitos torna esta tarefa exaustiva. Nesta disertação apresentamos uma plataforma web chamada GRACE, que providencia uma interface gráfica de exploração que permite aos utilizadores navegar pelo domínio biomédico em busca de associações explícitas ou latentes entre conceitos biomédicos pertencentes a uma variedade de domínios semânticos (i.e., Genes, Proteínas, Doenças, Procedimentos e Anatomia). A base de conhecimento usada é uma coleção de artigos MEDLINE com resumos escritos na língua inglesa. Estas anotações são armazenadas numa base de dados que permite pesquisas complexas e obtenção de dados com alta performance. As relações entre conceitos são inferidas a partir de análise estatística, aplicando medidas de associações entre os conceitos anotados. Estes processos permitem à interface gráfica criar, em tempo real, uma visualização de dados, na forma de um grafo, para a exploração destas relações entre conceitos do domínio biomédico.
Martinez-Alvarez, Miguel. "Knowledge-enhanced text classification : descriptive modelling and new approaches." Thesis, Queen Mary, University of London, 2014. http://qmro.qmul.ac.uk/xmlui/handle/123456789/27205.
Full textDutra, Marcio Branquinho. "Busca guiada de patentes de Bioinformática." Universidade de São Paulo, 2013. http://www.teses.usp.br/teses/disponiveis/95/95131/tde-07022014-150130/.
Full textPatents are temporary public licenses granted by the State to ensure to inventors and assignees economical exploration rights. Trademark and patent offices recommend to perform wide searches in different databases using classic patent search systems and specific tools before a patent\'s application. The goal of these searches is to ensure the invention has not been published yet, either in its original field or in other fields. Researches have shown the use of classification information improves the efficiency on searches for patents. The objetive of the research related to this work is to explore linguistic artifacts, Information Retrieval techniques and Automatic Classification techniques, to guide searches for Bioinformatics patents. The result of this work is the Bioinformatics Patent Search System (BPS), that uses automatic classification to guide searches for Bioinformatics patents. The utility of BPS is illustrated by a comparison with other patent search tools. In the future, BPS system must be experimented with more robust collections.
Artchounin, Daniel. "Tuning of machine learning algorithms for automatic bug assignment." Thesis, Linköpings universitet, Programvara och system, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-139230.
Full textDang, Quoc Bao. "Information spotting in huge repositories of scanned document images." Thesis, La Rochelle, 2018. http://www.theses.fr/2018LAROS024/document.
Full textThis work aims at developing a generic framework which is able to produce camera-based applications of information spotting in huge repositories of heterogeneous content document images via local descriptors. The targeted systems may take as input a portion of an image acquired as a query and the system is capable of returning focused portion of database image that match the query best. We firstly propose a set of generic feature descriptors for camera-based document images retrieval and spotting systems. Our proposed descriptors comprise SRIF, PSRIF, DELTRIF and SSKSRIF that are built from spatial space information of nearest keypoints around a keypoints which are extracted from centroids of connected components. From these keypoints, the invariant geometrical features are considered to be taken into account for the descriptor. SRIF and PSRIF are computed from a local set of m nearest keypoints around a keypoint. While DELTRIF and SSKSRIF can fix the way to combine local shape description without using parameter via Delaunay triangulation formed from a set of keypoints extracted from a document image. Furthermore, we propose a framework to compute the descriptors based on spatial space of dedicated keypoints e.g SURF or SIFT or ORB so that they can deal with heterogeneous-content camera-based document image retrieval and spotting. In practice, a large-scale indexing system with an enormous of descriptors put the burdens for memory when they are stored. In addition, high dimension of descriptors can make the accuracy of indexing reduce. We propose three robust indexing frameworks that can be employed without storing local descriptors in the memory for saving memory and speeding up retrieval time by discarding distance validating. The randomized clustering tree indexing inherits kd-tree, kmean-tree and random forest from the way to select K dimensions randomly combined with the highest variance dimension from each node of the tree. We also proposed the weighted Euclidean distance between two data points that is computed and oriented the highest variance dimension. The secondly proposed hashing relies on an indexing system that employs one simple hash table for indexing and retrieving without storing database descriptors. Besides, we propose an extended hashing based method for indexing multi-kinds of features coming from multi-layer of the image. Along with proposed descriptors as well indexing frameworks, we proposed a simple robust way to compute shape orientation of MSER regions so that they can combine with dedicated descriptors (e.g SIFT, SURF, ORB and etc.) rotation invariantly. In the case that descriptors are able to capture neighborhood information around MSER regions, we propose a way to extend MSER regions by increasing the radius of each region. This strategy can be also applied for other detected regions in order to make descriptors be more distinctive. Moreover, we employed the extended hashing based method for indexing multi-kinds of features from multi-layer of images. This system are not only applied for uniform feature type but also multiple feature types from multi-layers separated. Finally, in order to assess the performances of our contributions, and based on the assessment that no public dataset exists for camera-based document image retrieval and spotting systems, we built a new dataset which has been made freely and publicly available for the scientific community. This dataset contains portions of document images acquired via a camera as a query. It is composed of three kinds of information: textual content, graphical content and heterogeneous content
Wächter, Thomas. "Semi-automated Ontology Generation for Biocuration and Semantic Search." Doctoral thesis, Saechsische Landesbibliothek- Staats- und Universitaetsbibliothek Dresden, 2011. http://nbn-resolving.de/urn:nbn:de:bsz:14-qucosa-64838.
Full textDelgado, Diogo Miguel Melo. "Automated illustration of multimedia stories." Master's thesis, Faculdade de Ciências e Tecnologia, 2010. http://hdl.handle.net/10362/4478.
Full textWe all had the problem of forgetting about what we just read a few sentences before. This comes from the problem of attention and is more common with children and the elderly. People feel either bored or distracted by something more interesting. The challenge is how can multimedia systems assist users in reading and remembering stories? One solution is to use pictures to illustrate stories as a mean to captivate ones interest as it either tells a story or makes the viewer imagine one. This thesis researches the problem of automated story illustration as a method to increase the readers’ interest and attention. We formulate the hypothesis that an automated multimedia system can help users in reading a story by stimulating their reading memory with adequate visual illustrations. We propose a framework that tells a story and attempts to capture the readers’ attention by providing illustrations that spark the readers’ imagination. The framework automatically creates a multimedia presentation of the news story by (1) rendering news text in a sentence by-sentence fashion, (2) providing mechanisms to select the best illustration for each sentence and (3) select the set of illustrations that guarantees the best sequence. These mechanisms are rooted in image and text retrieval techniques. To further improve users’ attention, users may also activate a text-to-speech functionality according to their preference or reading difficulties. First experiments show how Flickr images can illustrate BBC news articles and provide a better experience to news readers. On top of the illustration methods, a user feedback feature was implemented to perfect the illustrations selection. With this feature users can aid the framework in selecting more accurate results. Finally, empirical evaluations were performed in order to test the user interface,image/sentence association algorithms and users’ feedback functionalities. The respective results are discussed.
Wu, Zimin. "A partial syntactic analysis-based pre-processor for automatic indexing and retrieval of Chinese texts." Thesis, Loughborough University, 1992. https://dspace.lboro.ac.uk/2134/13685.
Full textAdemi, Muhamet. "adXtractor – Automated and Adaptive Generation of Wrappers for Information Retrieval." Thesis, Malmö högskola, Fakulteten för teknik och samhälle (TS), 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:mau:diva-20071.
Full textSvensson, Pontus. "Automated Image Suggestions for News Articles : An Evaluation of Text and Image Representations in an Image Retrieval System." Thesis, Linköpings universitet, Interaktiva och kognitiva system, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-166669.
Full textDziczkowski, Grzegorz. "Analyse des sentiments : système autonome d'exploration des opinions exprimées dans les critiques cinématographiques." Phd thesis, École Nationale Supérieure des Mines de Paris, 2008. http://tel.archives-ouvertes.fr/tel-00408754.
Full text- la recherche automatique des critiques sur Internet,
- l'évaluation et la notation des opinions des critiques cinématographiques,
- la publication des résultats.
Afin d'améliorer les résultats d'application des algorithmes prédicatifs, l'objectif de ce système est de fournir un système de support pour les moteurs de prédiction analysant les profils des utilisateurs. Premièrement, le système recherche et récupère les probables critiques cinématographiques de l'Internet, en particulier celles exprimées par les commentateurs prolifiques.
Par la suite, le système procède à une évaluation et à une notation de l'opinion
exprimée dans ces critiques cinématographiques pour automatiquement associer
une note numérique à chaque critique ; tel est l'objectif du système.
La dernière étape est de regrouper les critiques (ainsi que les notes) avec l'utilisateur qui les a écrites afin de créer des profils complets, et de mettre à disposition ces profils pour les moteurs de prédictions.
Pour le développement de ce système, les travaux de recherche de cette thèse portaient essentiellement sur la notation des sentiments ; ces travaux s'insérant dans les domaines de ang : Opinion Mining et d'Analyse des Sentiments.
Notre système utilise trois méthodes différentes pour le classement des opinions. Nous présentons deux nouvelles méthodes ; une fondée sur les connaissances linguistiques et une fondée sur la limite de traitement statistique et linguistique. Les résultats obtenus sont ensuite comparés avec la méthode statistique basée sur le classificateur de Bayes, largement utilisée dans le domaine.
Il est nécessaire ensuite de combiner les résultats obtenus, afin de rendre l'évaluation finale aussi précise que possible. Pour cette tâche nous avons utilisé un quatrième classificateur basé sur les réseaux de neurones.
Notre notation des sentiments à savoir la notation des critiques est effectuée sur une échelle de 1 à 5. Cette notation demande une analyse linguistique plus profonde qu'une notation seulement binaire : positive ou négative, éventuellement subjective ou objective, habituellement utilisée.
Cette thèse présente de manière globale tous les modules du système conçu et de manière plus détaillée la partie de notation de l'opinion. En particulier, nous mettrons en évidence les avantages de l'analyse linguistique profonde moins utilisée dans le domaine de l'analyse des sentiments que l'analyse statistique.
Şentürk, Sertan. "Computational analysis of audio recordings and music scores for the description and discovery of Ottoman-Turkish Makam music." Doctoral thesis, Universitat Pompeu Fabra, 2017. http://hdl.handle.net/10803/402102.
Full textEsta tesis aborda varias limitaciones de las metodologías más avanzadas en el campo de recuperación de información musical (MIR por sus siglas en inglés). En particular, propone varios métodos computacionales para el análisis y la descripción automáticas de partituras y grabaciones de audio de música de makam turco-otomana (MMTO). Las principales contribuciones de la tesis son el corpus de música que ha sido creado para el desarrollo de la investigación y la metodología para alineamiento de audio y partitura desarrollada para el análisis del corpus. Además, se presentan varias metodologías nuevas para análisis computacional en el contexto de las tareas comunes de MIR que son relevantes para MMTO. Algunas de estas tareas son, por ejemplo, extracción de la melodía predominante, identificación de la tónica, estimación de tempo, reconocimiento de makam, análisis de afinación, análisis estructural y análisis de progresión melódica. Estas metodologías constituyen las partes de un sistema completo para la exploración de grandes corpus de MMTO llamado Dunya-makam. La tesis comienza presentando el corpus de música de makam turcootomana de CompMusic. El corpus incluye 2200 partituras, más de 6500 grabaciones de audio, y los metadatos correspondientes. Los datos han sido recopilados, anotados y revisados con la ayuda de expertos. Utilizando criterios como compleción, cobertura y calidad, validamos el corpus y mostramos su potencial para investigación. De hecho, nuestro corpus constituye el recurso de mayor tamaño y representatividad disponible para la investigación computacional de MMTO. Varios conjuntos de datos para experimentación han sido igualmente creados a partir del corpus, con el fin de desarrollar y evaluar las metodologías específicas propuestas para las diferentes tareas computacionales abordadas en la tesis. La parte dedicada al análisis de las partituras se centra en el análisis estructural a nivel de sección y de frase. Los márgenes de frase son identificados automáticamente usando uno de los métodos de segmentación existentes más avanzados. Los márgenes de sección son extraídos usando una heurística específica al formato de las partituras. A continuación, se emplea un método de nueva creación basado en análisis gráfico para establecer similitudes a través de estos elementos estructurales en cuanto a melodía y letra, así como para etiquetar relaciones semióticamente. La sección de análisis de audio de la tesis repasa el estado de la cuestión en cuanto a análisis de los aspectos melódicos en grabaciones de MMTO. Se proponen modificaciones de métodos existentes para extracción de melodía predominante para ajustarlas a MMTO. También se presentan mejoras de metodologías tanto para identificación de tónica basadas en distribución de alturas, como para reconocimiento de makam. La metodología para alineación de audio y partitura constituye el grueso de la tesis. Aborda los retos específicos de esta cultura según vienen determinados por las características musicales, las representaciones relacionadas con la teoría musical y la praxis oral de MMTO. Basada en varias técnicas tales como deformaciones dinámicas de tiempo subsecuentes, transformada de Hough y modelos de Markov de longitud variable, la metodología de alineamiento de audio y partitura está diseñada para tratar las diferencias estructurales entre partituras y grabaciones de audio. El método es robusto a la presencia de expresiones melódicas no anotadas, desviaciones de tiempo en las grabaciones, y diferencias de tónica y afinación. La metodología utiliza los resultados del análisis de partitura y audio para enlazar el audio y los datos simbólicos. Además, la metodología de alineación se usa para obtener una descripción informada por partitura de las grabaciones de audio. El análisis de audio informado por partitura no sólo simplifica los pasos para la extracción de características de audio que de otro modo requerirían sofisticados métodos de procesado de audio, sino que también mejora sustancialmente su rendimiento en comparación con los resultados obtenidos por los métodos más avanzados basados únicamente en datos de audio. Las metodologías analíticas presentadas en la tesis son aplicadas al corpus de música de makam turco-otomana de CompMusic e integradas en una aplicación web dedicada al descubrimiento culturalmente específico de música. Algunas de las metodologías ya han sido aplicadas a otras tradiciones musicales, como música indostaní, carnática y griega. Siguiendo las mejores prácticas de investigación en abierto, todos los datos creados, las herramientas de software y los resultados de análisis está disponibles públicamente. Las metodologías, las herramientas y el corpus en sí mismo ofrecen grandes oportunidades para investigaciones futuras en muchos campos tales como recuperación de información musical, musicología computacional y educación musical.
Aquesta tesi adreça diverses deficiències en l’estat actual de les metodologies d’extracció d’informació de música (Music Information Retrieval o MIR). En particular, la tesi proposa diverses estratègies per analitzar i descriure automàticament partitures musicals i enregistraments d’actuacions musicals de música Makam Turca Otomana (OTMM en les seves sigles en anglès). Les contribucions principals de la tesi són els corpus musicals que s’han creat en el context de la tesi per tal de dur a terme la recerca i la metodologia de alineament d’àudio amb la partitura que s’ha desenvolupat per tal d’analitzar els corpus. A més la tesi presenta diverses noves metodologies d’anàlisi computacional d’OTMM per a les tasques més habituals en MIR. Alguns exemples d’aquestes tasques són la extracció de la melodia principal, la identificació del to musical, l’estimació de tempo, el reconeixement de Makam, l’anàlisi de la afinació, l’anàlisi de la estructura musical i l’anàlisi de la progressió melòdica. Aquest seguit de metodologies formen part del sistema Dunya-makam per a la exploració de grans corpus musicals d’OTMM. En primer lloc, la tesi presenta el corpus CompMusic Ottoman- Turkish makam music. Aquest inclou 2200 partitures musicals, més de 6500 enregistraments d’àudio i metadata complementària. Les dades han sigut recopilades i anotades amb ajuda d’experts en aquest repertori musical. El corpus ha estat validat en termes de d’exhaustivitat, cobertura i qualitat i mostrem aquí el seu potencial per a la recerca. De fet, aquest corpus és el la font més gran i representativa de OTMM que pot ser utilitzada per recerca computacional. També s’han desenvolupat diversos subconjunts de dades per al desenvolupament i evaluació de les metodologies específiques proposades per a les diverses tasques computacionals que es presenten en aquest tesi. La secció de la tesi que tracta de l’anàlisi de partitures musicals se centra en l’anàlisi estructural a nivell de secció i de frase musical. Els límits temporals de les frases musicals s’identifiquen automàticament gràcies a un metodologia de segmentació d’última generació. Els límits de les seccions s’extreuen utilitzant un seguit de regles heurístiques determinades pel format de les partitures musicals. Posteriorment s’utilitza un nou mètode basat en anàlisi gràfic per establir semblances entre aquest elements estructurals en termes de melodia i text. També s’utilitza aquest mètode per etiquetar les relacions semiòtiques existents. La següent secció de la tesi tracta sobre anàlisi d’àudio i en particular revisa les tecnologies d’avantguardia d’anàlisi dels aspectes melòdics en OTMM. S’hi proposen adaptacions dels mètodes d’extracció de melodia existents que s’ajusten a OTMM. També s’hi presenten millores en metodologies de reconeixement de makam i en identificació de tònica basats en distribució de to. La metodologia d’alineament d’àudio amb partitura és el nucli de la tesi. Aquesta aborda els reptes culturalment específics imposats per les característiques musicals, les representacions de la teoria musical i la pràctica oral particulars de l’OTMM. Utilitzant diverses tècniques tal i com Dynamic Time Warping, Hough Transform o models de Markov de durada variable, la metodologia d’alineament esta dissenyada per enfrontar les diferències estructurals entre partitures musicals i enregistraments d’àudio. El mètode és robust inclús en presència d’expressions musicals no anotades en la partitura, desviacions de tempo ocorregudes en les actuacions musicals i diferències de tònica i afinació. La metodologia aprofita els resultats de l’anàlisi de la partitura i l’àudio per enllaçar la informació simbòlica amb l’àudio. A més, la tècnica d’alineament s’utilitza per obtenir descripcions de l’àudio fonamentades en la partitura. L’anàlisi de l’àudio fonamentat en la partitura no només simplifica les fases d’extracció de característiques d’àudio que requeririen de mètodes de processament d’àudio sofisticats, sinó que a més millora substancialment els resultats comparat amb altres mètodes d´ultima generació que només depenen de contingut d’àudio. Les metodologies d’anàlisi presentades s’han utilitzat per analitzar el corpus CompMusic Ottoman-Turkish makam music i s’han integrat en una aplicació web destinada al descobriment musical de tradicions culturals específiques. Algunes de les metodologies ja han sigut també aplicades a altres tradicions musicals com la Hindustani, la Carnàtica i la Grega. Seguint els preceptes de la investigació oberta totes les dades creades, eines computacionals i resultats dels anàlisis estan disponibles obertament. Tant les metodologies, les eines i el corpus en si mateix proporcionen àmplies oportunitats per recerques futures en diversos camps de recerca tal i com la musicologia computacional, la extracció d’informació musical i la educació musical. Traducció d’anglès a català per Oriol Romaní Picas.
Yusan, Wang, and 王愚善. "Automatic Text Corpora Retrieval in Example-Based Machine Translation." Thesis, 2001. http://ndltd.ncl.edu.tw/handle/67545771395405026129.
Full text國立成功大學
資訊管理研究所
89
Translation is often a matter of finding analogous examples in linguistic databanks and of discovering how a particular source language has been translated before. Example-based approaches for machine translation (MT) are generally viewed as alternatives to knowledge-based methods and the supplement of traditional rule-based methods, of which three major steps — analysis, transfer and generation — are included. Researchers have shown many advantages in the use of example-based approach for translation, one of which allows the user to select text corpora for a specific domain for the sake of efficiency in matching and the better quality of translation to the target language. However, the selection of text corpora is generally accomplished by those, if not a translator, who must be equipped with a good capability to read the source language as well as to justify the category of the text. As a result, the monolingual user is unable to take the advantage, but has to experience much longer time during the text corpus matching in the bilingual (or multilingual) knowledge base. In this study, the proposed approach, namely Automatic Text Corpora Retrieval (ATCR), is able to automate the process to identify the corpora to which the source text is mostly related.
"A concept-space based multi-document text summarizer." 2001. http://library.cuhk.edu.hk/record=b5890766.
Full textThesis (M.Phil.)--Chinese University of Hong Kong, 2001.
Includes bibliographical references (leaves 88-94).
Abstracts in English and Chinese.
List of Figures --- p.vi
List of Tables --- p.vii
Chapter 1. --- INTRODUCTION --- p.1
Chapter 1.1 --- Information Overloading and Low Utilization --- p.2
Chapter 1.2 --- Problem Needs To Solve --- p.3
Chapter 1.3 --- Research Contributions --- p.4
Chapter 1.3.1 --- Using Concept Space in Summarization --- p.5
Chapter 1.3.2 --- New Extraction Method --- p.5
Chapter 1.3.3 --- Experiments on New System --- p.6
Chapter 1.4 --- Organization of This Thesis --- p.7
Chapter 2. --- LITERATURE REVIEW --- p.8
Chapter 2.1 --- Classical Approach --- p.8
Chapter 2.1.1 --- Luhn's Algorithm --- p.9
Chapter 2.1.2 --- Edumundson's Algorithm --- p.11
Chapter 2.2 --- Statistical Approach --- p.15
Chapter 2.3 --- Natural Language Processing Approach --- p.15
Chapter 3. --- PROPOSED SUMMARIZATION APPROACH --- p.18
Chapter 3.1 --- Direction of Summarization --- p.19
Chapter 3.2 --- Overview of Summarization Algorithm --- p.20
Chapter 3.2.1 --- Document Pre-processing --- p.21
Chapter 3.2.2 --- Vector Space Model --- p.23
Chapter 3.2.3 --- Sentence Extraction --- p.24
Chapter 3.3 --- Evaluation Method --- p.25
Chapter 3.3.1 --- "Recall, Precision and F-measure" --- p.25
Chapter 3.4 --- Advantage of Concept Space Approach --- p.26
Chapter 4. --- SYSTEM ARCHITECTURE --- p.27
Chapter 4.1 --- Converge Process --- p.28
Chapter 4.2 --- Diverge Process --- p.30
Chapter 4.3 --- Backward Search --- p.31
Chapter 5. --- CONVERGE PROCESS --- p.32
Chapter 5.1 --- Document Merging --- p.32
Chapter 5.2 --- Word Phrase Extraction --- p.34
Chapter 5.3 --- Automatic Indexing --- p.34
Chapter 5.4 --- Cluster Analysis --- p.35
Chapter 5.5 --- Hopfield Net Classification --- p.37
Chapter 6. --- DIVERGE PROCESS --- p.42
Chapter 6.1 --- Concept Terms Refinement --- p.42
Chapter 6.2 --- Sentence Selection --- p.43
Chapter 6.3 --- Backward Searching --- p.46
Chapter 7. --- EXPERIMENT AND RESEARCH FINDINGS --- p.48
Chapter 7.1 --- System-generated Summary v.s. Source Documents --- p.52
Chapter 7.1.1 --- Compression Ratio --- p.52
Chapter 7.1.2 --- Information Loss --- p.54
Chapter 7.2 --- System-generated Summary v.s. Human-generated Summary --- p.58
Chapter 7.2.1 --- Background of EXTRACTOR --- p.59
Chapter 7.2.2 --- Evaluation Method --- p.61
Chapter 7.3 --- Evaluation of different System-generated Summaries by Human Experts --- p.63
Chapter 8. --- CONCLUSIONS AND FUTURE RESEARCH --- p.68
Chapter 8.1 --- Conclusions --- p.68
Chapter 8.2 --- Future Work --- p.69
Chapter A. --- EXTRACTOR SYSTEM FLOW AND TEN-STEP PROCEDURE --- p.71
Chapter B. --- SUMMARY GENERATED BY MS WORD2000 --- p.75
Chapter C. --- SUMMARY GENERATED BY EXTRACTOR SOFTWARE --- p.76
Chapter D. --- SUMMARY GENERATED BY OUR SYSTEM --- p.77
Chapter E. --- SYSTEM-GENERATED WORD PHRASES FROM TEST SAMPLE --- p.78
Chapter F. --- WORD PHRASES IDENTIFIED BY SUBJECTS --- p.79
Chapter G. --- SAMPLE OF QUESTIONNAIRE --- p.84
Chapter H. --- RESULT OF QUESTIONNAIRE --- p.85
Chapter I. --- EVALUATION FOR DIVERGE PROCESS --- p.86
BIBLIOGRAPHY --- p.88
"A probabilistic approach for automatic text filtering." 1998. http://library.cuhk.edu.hk/record=b5889506.
Full textThesis (M.Phil.)--Chinese University of Hong Kong, 1998.
Includes bibliographical references (leaves 165-168).
Abstract also in Chinese.
Abstract --- p.i
Acknowledgment --- p.iv
Chapter 1 --- Introduction --- p.1
Chapter 1.1 --- Overview of Information Filtering --- p.1
Chapter 1.2 --- Contributions --- p.4
Chapter 1.3 --- Organization of this thesis --- p.6
Chapter 2 --- Existing Approaches --- p.7
Chapter 2.1 --- Representational issues --- p.7
Chapter 2.1.1 --- Document Representation --- p.7
Chapter 2.1.2 --- Feature Selection --- p.11
Chapter 2.2 --- Traditional Approaches --- p.15
Chapter 2.2.1 --- NewsWeeder --- p.15
Chapter 2.2.2 --- NewT --- p.17
Chapter 2.2.3 --- SIFT --- p.19
Chapter 2.2.4 --- InRoute --- p.20
Chapter 2.2.5 --- Motivation of Our Approach --- p.21
Chapter 2.3 --- Probabilistic Approaches --- p.23
Chapter 2.3.1 --- The Naive Bayesian Approach --- p.25
Chapter 2.3.2 --- The Bayesian Independence Classifier Approach --- p.28
Chapter 2.4 --- Comparison --- p.31
Chapter 3 --- Our Bayesian Network Approach --- p.33
Chapter 3.1 --- Backgrounds of Bayesian Networks --- p.34
Chapter 3.2 --- Bayesian Network Induction Approach --- p.36
Chapter 3.3 --- Automatic Construction of Bayesian Networks --- p.38
Chapter 4 --- Automatic Feature Discretization --- p.50
Chapter 4.1 --- Predefined Level Discretization --- p.52
Chapter 4.2 --- Lloyd's algorithm . . > --- p.53
Chapter 4.3 --- Class Dependence Discretization --- p.55
Chapter 5 --- Experiments and Results --- p.59
Chapter 5.1 --- Document Collections --- p.60
Chapter 5.2 --- Batch Filtering Experiments --- p.63
Chapter 5.3 --- Batch Filtering Results --- p.65
Chapter 5.4 --- Incremental Session Filtering Experiments --- p.87
Chapter 5.5 --- Incremental Session Filtering Results --- p.88
Chapter 6 --- Conclusions and Future Work --- p.105
Appendix A --- p.107
Appendix B --- p.116
Appendix C --- p.126
Appendix D --- p.131
Appendix E --- p.145
"New learning strategies for automatic text categorization." 2001. http://library.cuhk.edu.hk/record=b5890838.
Full textThesis (M.Phil.)--Chinese University of Hong Kong, 2001.
Includes bibliographical references (leaves 125-130).
Abstracts in English and Chinese.
Chapter 1 --- Introduction --- p.1
Chapter 1.1 --- Automatic Textual Document Categorization --- p.1
Chapter 1.2 --- Meta-Learning Approach For Text Categorization --- p.3
Chapter 1.3 --- Contributions --- p.6
Chapter 1.4 --- Organization of the Thesis --- p.7
Chapter 2 --- Related Work --- p.9
Chapter 2.1 --- Existing Automatic Document Categorization Approaches --- p.9
Chapter 2.2 --- Existing Meta-Learning Approaches For Information Retrieval --- p.14
Chapter 2.3 --- Our Meta-Learning Approaches --- p.20
Chapter 3 --- Document Pre-Processing --- p.22
Chapter 3.1 --- Document Representation --- p.22
Chapter 3.2 --- Classification Scheme Learning Strategy --- p.25
Chapter 4 --- Linear Combination Approach --- p.30
Chapter 4.1 --- Overview --- p.30
Chapter 4.2 --- Linear Combination Approach - The Algorithm --- p.33
Chapter 4.2.1 --- Equal Weighting Strategy --- p.34
Chapter 4.2.2 --- Weighting Strategy Based On Utility Measure --- p.34
Chapter 4.2.3 --- Weighting Strategy Based On Document Rank --- p.35
Chapter 4.3 --- Comparisons of Linear Combination Approach and Existing Meta-Learning Methods --- p.36
Chapter 4.3.1 --- LC versus Simple Majority Voting --- p.36
Chapter 4.3.2 --- LC versus BORG --- p.38
Chapter 4.3.3 --- LC versus Restricted Linear Combination Method --- p.38
Chapter 5 --- The New Meta-Learning Model - MUDOF --- p.40
Chapter 5.1 --- Overview --- p.41
Chapter 5.2 --- Document Feature Characteristics --- p.42
Chapter 5.3 --- Classification Errors --- p.44
Chapter 5.4 --- Linear Regression Model --- p.45
Chapter 5.5 --- The MUDOF Algorithm --- p.47
Chapter 6 --- Incorporating MUDOF into Linear Combination approach --- p.52
Chapter 6.1 --- Background --- p.52
Chapter 6.2 --- Overview of MUDOF2 --- p.54
Chapter 6.3 --- Major Components of the MUDOF2 --- p.57
Chapter 6.4 --- The MUDOF2 Algorithm --- p.59
Chapter 7 --- Experimental Setup --- p.66
Chapter 7.1 --- Document Collection --- p.66
Chapter 7.2 --- Evaluation Metric --- p.68
Chapter 7.3 --- Component Classification Algorithms --- p.71
Chapter 7.4 --- Categorical Document Feature Characteristics for MUDOF and MUDOF2 --- p.72
Chapter 8 --- Experimental Results and Analysis --- p.74
Chapter 8.1 --- Performance of Linear Combination Approach --- p.74
Chapter 8.2 --- Performance of the MUDOF Approach --- p.78
Chapter 8.3 --- Performance of MUDOF2 Approach --- p.87
Chapter 9 --- Conclusions and Future Work --- p.96
Chapter 9.1 --- Conclusions --- p.96
Chapter 9.2 --- Future Work --- p.98
Chapter A --- Details of Experimental Results for Reuters-21578 corpus --- p.99
Chapter B --- Details of Experimental Results for OHSUMED corpus --- p.114
Bibliography --- p.125
"Automatic text categorization for information filtering." 1998. http://library.cuhk.edu.hk/record=b5889734.
Full textThesis (M.Phil.)--Chinese University of Hong Kong, 1998.
Includes bibliographical references (leaves 157-163).
Abstract also in Chinese.
Abstract --- p.i
Acknowledgment --- p.iii
List of Figures --- p.viii
List of Tables --- p.xiv
Chapter 1 --- Introduction --- p.1
Chapter 1.1 --- Automatic Document Categorization --- p.1
Chapter 1.2 --- Information Filtering --- p.3
Chapter 1.3 --- Contributions --- p.6
Chapter 1.4 --- Organization of the Thesis --- p.7
Chapter 2 --- Related Work --- p.9
Chapter 2.1 --- Existing Automatic Document Categorization Approaches --- p.9
Chapter 2.1.1 --- Rule-Based Approach --- p.10
Chapter 2.1.2 --- Similarity-Based Approach --- p.13
Chapter 2.2 --- Existing Information Filtering Approaches --- p.19
Chapter 2.2.1 --- Information Filtering Systems --- p.19
Chapter 2.2.2 --- Filtering in TREC --- p.21
Chapter 3 --- Document Pre-Processing --- p.23
Chapter 3.1 --- Document Representation --- p.23
Chapter 3.2 --- Classification Scheme Learning Strategy --- p.26
Chapter 4 --- A New Approach - IBRI --- p.31
Chapter 4.1 --- Overview of Our New IBRI Approach --- p.31
Chapter 4.2 --- The IBRI Representation and Definitions --- p.34
Chapter 4.3 --- The IBRI Learning Algorithm --- p.37
Chapter 5 --- IBRI Experiments --- p.43
Chapter 5.1 --- Experimental Setup --- p.43
Chapter 5.2 --- Evaluation Metric --- p.45
Chapter 5.3 --- Results --- p.46
Chapter 6 --- A New Approach - GIS --- p.50
Chapter 6.1 --- Motivation of GIS --- p.50
Chapter 6.2 --- Similarity-Based Learning --- p.51
Chapter 6.3 --- The Generalized Instance Set Algorithm (GIS) --- p.58
Chapter 6.4 --- Using GIS Classifiers for Classification --- p.63
Chapter 6.5 --- Time Complexity --- p.64
Chapter 7 --- GIS Experiments --- p.68
Chapter 7.1 --- Experimental Setup --- p.68
Chapter 7.2 --- Results --- p.73
Chapter 8 --- A New Information Filtering Approach Based on GIS --- p.87
Chapter 8.1 --- Information Filtering Systems --- p.87
Chapter 8.2 --- GIS-Based Information Filtering --- p.90
Chapter 9 --- Experiments on GIS-based Information Filtering --- p.95
Chapter 9.1 --- Experimental Setup --- p.95
Chapter 9.2 --- Results --- p.100
Chapter 10 --- Conclusions and Future Work --- p.108
Chapter 10.1 --- Conclusions --- p.108
Chapter 10.2 --- Future Work --- p.110
Chapter A --- Sample Documents in the corpora --- p.111
Chapter B --- Details of Experimental Results of GIS --- p.120
Chapter C --- Computational Time of Reuters-21578 Experiments --- p.141
Kuan-MingChou and 周冠銘. "Using automatic keywords extraction and text clustering methods for medical information retrieval improvement." Thesis, 2013. http://ndltd.ncl.edu.tw/handle/80362319360586009723.
Full text國立成功大學
醫學資訊研究所
101
Because there are huge data on the web, it will get many duplicate and near-duplicate search results when we search on the web. The motivation of this thesis is that reduce the time of filtering the huge duplicate and near-duplicate information when user search. In this thesis, we propose a novel clustering method to solve near-duplicate problem. Our method transforms each document to a feature vector, where the weights are terms frequency of each corresponding words. For reducing the dimension of these feature vectors, we used principle component analysis to transform these vectors to another space. After PCA, we used cosine similarity to compute the similarity of each document. And then, we used EM algorithm and Neyman-Pearson hypothesis test to cluster the duplicate documents. We compared out results with K-means method results. The experiments show that our method is outperformer than K-means method.
(9761117), Shayan Ali A. Akbar. "Source code search for automatic bug localization." Thesis, 2020.
Find full text"Automatic index generation for the free-text based database." Chinese University of Hong Kong, 1992. http://library.cuhk.edu.hk/record=b5887040.
Full textThesis (M.Phil.)--Chinese University of Hong Kong, 1992.
Includes bibliographical references (leaves 183-184).
Chapter Chapter one: --- Introduction --- p.1
Chapter Chapter two: --- Background knowledge and linguistic approaches of automatic indexing --- p.5
Chapter 2.1 --- Definition of index and indexing --- p.5
Chapter 2.2 --- Indexing methods and problems --- p.7
Chapter 2.3 --- Automatic indexing and human indexing --- p.8
Chapter 2.4 --- Different approaches of automatic indexing --- p.10
Chapter 2.5 --- Example of semantic approach --- p.11
Chapter 2.6 --- Example of syntactic approach --- p.14
Chapter 2.7 --- Comments on semantic and syntactic approaches --- p.18
Chapter Chapter three: --- Rationale and methodology of automatic index generation --- p.19
Chapter 3.1 --- Problems caused by natural language --- p.19
Chapter 3.2 --- Usage of word frequencies --- p.20
Chapter 3.3 --- Brief description of rationale --- p.24
Chapter 3.4 --- Automatic index generation --- p.27
Chapter 3.4.1 --- Training phase --- p.27
Chapter 3.4.1.1 --- Selection of training documents --- p.28
Chapter 3.4.1.2 --- Control and standardization of variants of words --- p.28
Chapter 3.4.1.3 --- Calculation of associations between words and indexes --- p.30
Chapter 3.4.1.4 --- Discarding false associations --- p.33
Chapter 3.4.2 --- Indexing phase --- p.38
Chapter 3.4.3 --- Example of automatic indexing --- p.41
Chapter 3.5 --- Related researches --- p.44
Chapter 3.6 --- Word diversity and its effect on automatic indexing --- p.46
Chapter 3.7 --- Factors affecting performance of automatic indexing --- p.60
Chapter 3.8 --- Application of semantic representation --- p.61
Chapter 3.8.1 --- Problem of natural language --- p.61
Chapter 3.8.2 --- Use of concept headings --- p.62
Chapter 3.8.3 --- Example of using concept headings in automatic indexing --- p.65
Chapter 3.8.4 --- Advantages of concept headings --- p.68
Chapter 3.8.5 --- Disadvantages of concept headings --- p.69
Chapter 3.9 --- Correctness prediction for proposed indexes --- p.78
Chapter 3.9.1 --- Example of using index proposing rate --- p.80
Chapter 3.10 --- Effect of subject matter on automatic indexing --- p.83
Chapter 3.11 --- Comparison with other indexing methods --- p.85
Chapter 3.12 --- Proposal for applying Chinese medical knowledge --- p.90
Chapter Chapter four: --- Simulations of automatic index generation --- p.93
Chapter 4.1 --- Training phase simulations --- p.93
Chapter 4.1.1 --- Simulation of association calculation (word diversity uncontrolled) --- p.94
Chapter 4.1.2 --- Simulation of association calculation (word diversity controlled) --- p.102
Chapter 4.1.3 --- Simulation of discarding false associations --- p.107
Chapter 4.2 --- Indexing phase simulation --- p.115
Chapter 4.3 --- Simulation of using concept headings --- p.120
Chapter 4.4 --- Simulation for testing performance of predicting index correctness --- p.125
Chapter 4.5 --- Summary --- p.128
Chapter Chapter five: --- Real case study in database of Chinese Medicinal Material Research Center --- p.130
Chapter 5.1 --- Selection of real documents --- p.130
Chapter 5.2 --- Case study one: Overall performance using real data --- p.132
Chapter 5.2.1 --- Sample results of automatic indexing for real documents --- p.138
Chapter 5.3 --- Case study two: Using multi-word terms --- p.148
Chapter 5.4 --- Case study three: Using concept headings --- p.152
Chapter 5.5 --- Case study four: Prediction of proposed index correctness --- p.156
Chapter 5.6 --- Case study five: Use of (Σ ΔRij) Fi to determine false association --- p.159
Chapter 5.7 --- Case study six: Effect of word diversity --- p.162
Chapter 5.8 --- Summary --- p.166
Chapter Chapter six: --- Conclusion --- p.168
Appendix A: List of stopwords --- p.173
Appendix B: Index terms used in case studies --- p.174
References --- p.183
Vidal-Santos, Gerard. "Avaluació de processos de reconeixement d’entitats (NER) com a complement a interfícies de recuperació d’informació en dipòsits digitals complexos." Thesis, 2018. http://eprints.rclis.org/33589/1/VidalSantos_TFG_2018.pdf.
Full textVidal-Santos, Gerard. "Avaluació de processos de reconeixement d’entitats (NER) com a complement a interfícies de recuperació d’informació en dipòsits digitals complexos." Thesis, 2018. http://eprints.rclis.org/33692/1/VidalSantos_TFG_2018.pdf.
Full textKhoo, Christopher S. G. "Automatic identification of causal relations in text and their use for improving precision in information retrieval." Thesis, 1995. http://hdl.handle.net/10150/105106.
Full textThis study represents one attempt to make use of relations expressed in text to improve information retrieval effectiveness. In particular, the study investigated whether the information obtained by matching causal relations expressed in documents with the causal relations expressed in users' queries could be used to improve document retrieval results in comparison to using just term matching without considering relations. An automatic method for identifying and extracting cause-effect information in Wall Street Journal text was developed. The method uses linguistic clues to identify causal relations without recourse to knowledge-based inferencing. The method was successful in identifying and extracting about 68% of the causal relations that were clearly expressed within a sentence or between adjacent sentences in Wall Street Journal text. Of the instances that the computer program identified as causal relations, 72% can be considered to be correct. The automatic method was used in an experimental information retrieval system to identify causal relations in a database of full-text Wall Street Journal documents. Causal relation matching was found to yield a small but significant improvement in retrieval results when the weights used for combining the scores from different types of matching were customized for each query -- as in an SDI or routing queries situation. The best results were obtained when causal relation matching was combined with word proximity matching (matching pairs of causally related words in the query with pairs of words that co-occur within document sentences). An analysis using manually identified causal relations indicate that bigger retrieval improvements can be expected with more accurate identification of causal relations. The best kind of causal relation matching was found to be one in which one member of the causal relation (either the cause or the effect) was represented as a wildcard that could match with any term. The study also investigated whether using Roget's International Thesaurus (3rd ed.) to expand query terms with synonymous and related terms would improve retrieval effectiveness. Using Roget category codes in addition to keywords did give better retrieval results. However, the Roget codes were better at identifying the non-relevant documents than the relevant ones. Parts of the thesis were published in: 1. Khoo, C., Myaeng, S.H., & Oddy, R. (2001). Using cause-effect relations in text to improve information retrieval precision. Information Processing and Management, 37(1), 119-145. 2. Khoo, C., Kornfilt, J., Oddy, R., & Myaeng, S.H. (1998). Automatic extraction of cause-effect information from newspaper text without knowledge-based inferencing. Literary & Linguistic Computing, 13(4), 177-186. 3. Khoo, C. (1997). The use of relation matching in information retrieval. LIBRES: Library and Information Science Research Electronic Journal [Online], 7(2). Available at: http://aztec.lib.utk.edu/libres/libre7n2/. An update of the literature review on causal relations in text was published in: Khoo, C., Chan, S., & Niu, Y. (2002). The many facets of the cause-effect relation. In R.Green, C.A. Bean & S.H. Myaeng (Eds.), The semantics of relationships: An interdisciplinary perspective (pp. 51-70). Dordrecht: Kluwer
Williams, Kyle. "Learning to Read Bushman: Automatic Handwriting Recognition for Bushman Languages." Thesis, 2012. http://pubs.cs.uct.ac.za/archive/00000791/.
Full text"Automatic construction of wrappers for semi-structured documents." 2001. http://library.cuhk.edu.hk/record=b5890663.
Full textThesis (M.Phil.)--Chinese University of Hong Kong, 2001.
Includes bibliographical references (leaves 114-123).
Abstracts in English and Chinese.
Chapter 1 --- Introduction --- p.1
Chapter 1.1 --- Information Extraction --- p.1
Chapter 1.2 --- IE from Semi-structured Documents --- p.3
Chapter 1.3 --- Thesis Contributions --- p.7
Chapter 1.4 --- Thesis Organization --- p.9
Chapter 2 --- Related Work --- p.11
Chapter 2.1 --- Existing Approaches --- p.11
Chapter 2.2 --- Limitations of Existing Approaches --- p.18
Chapter 2.3 --- Our HISER Approach --- p.20
Chapter 3 --- System Overview --- p.23
Chapter 3.1 --- Hierarchical record Structure and Extraction Rule learning (HISER) --- p.23
Chapter 3.2 --- Hierarchical Record Structure --- p.29
Chapter 3.3 --- Extraction Rule --- p.29
Chapter 3.4 --- Wrapper Adaptation --- p.32
Chapter 4 --- Automatic Hierarchical Record Structure Construction --- p.34
Chapter 4.1 --- Motivation --- p.34
Chapter 4.2 --- Hierarchical Record Structure Representation --- p.36
Chapter 4.3 --- Constructing Hierarchical Record Structure --- p.38
Chapter 5 --- Extraction Rule Induction --- p.43
Chapter 5.1 --- Rule Representation --- p.43
Chapter 5.2 --- Extraction Rule Induction Algorithm --- p.47
Chapter 6 --- Experimental Results of Wrapper Learning --- p.54
Chapter 6.1 --- Experimental Methodology --- p.54
Chapter 6.2 --- Results on Electronic Appliance Catalogs --- p.56
Chapter 6.3 --- Results on Book Catalogs --- p.60
Chapter 6.4 --- Results on Seminar Announcements --- p.62
Chapter 7 --- Adapting Wrappers to Unseen Information Sources --- p.69
Chapter 7.1 --- Motivation --- p.69
Chapter 7.2 --- Support Vector Machines --- p.72
Chapter 7.3 --- Feature Selection --- p.76
Chapter 7.4 --- Automatic Annotation of Training Examples --- p.80
Chapter 7.4.1 --- Building SVM Models --- p.81
Chapter 7.4.2 --- Seeking Potential Training Example Candidates --- p.82
Chapter 7.4.3 --- Classifying Potential Training Examples --- p.84
Chapter 8 --- Experimental Results of Wrapper Adaptation --- p.86
Chapter 8.1 --- Experimental Methodology --- p.86
Chapter 8.2 --- Results on Electronic Appliance Catalogs --- p.89
Chapter 8.3 --- Results on Book Catalogs --- p.93
Chapter 9 --- Conclusions and Future Work --- p.97
Chapter 9.1 --- Conclusions --- p.97
Chapter 9.2 --- Future Work --- p.100
Chapter A --- Sample Experimental Pages --- p.101
Chapter B --- Detailed Experimental Results of Wrapper Adaptation of HISER --- p.109
Bibliography --- p.114
Çapkın, Çağdaş. "Türkçe metin tabanlı açık arşivlerde kullanılan dizinleme yönteminin değerlendirilmesi / Evaluation of indexing method used in Turkish text-based open archives." Thesis, 2011. http://eprints.rclis.org/28804/1/Cagdas_CAPKIN_Yuksek_Lisans_tezi.pdf.
Full text"Automatic construction and adaptation of wrappers for semi-structured web documents." 2003. http://library.cuhk.edu.hk/record=b5891460.
Full textThesis (M.Phil.)--Chinese University of Hong Kong, 2003.
Includes bibliographical references (leaves 88-94).
Abstracts in English and Chinese.
Chapter 1 --- Introduction --- p.1
Chapter 1.1 --- Wrapper Induction for Semi-structured Web Documents --- p.1
Chapter 1.2 --- Adapting Wrappers to Unseen Web Sites --- p.6
Chapter 1.3 --- Thesis Contributions --- p.7
Chapter 1.4 --- Thesis Organization --- p.8
Chapter 2 --- Related Work --- p.10
Chapter 2.1 --- Related Work on Wrapper Induction --- p.10
Chapter 2.2 --- Related Work on Wrapper Adaptation --- p.16
Chapter 3 --- Automatic Construction of Hierarchical Wrappers --- p.20
Chapter 3.1 --- Hierarchical Record Structure Inference --- p.22
Chapter 3.2 --- Extraction Rule Induction --- p.30
Chapter 3.3 --- Applying Hierarchical Wrappers --- p.38
Chapter 4 --- Experimental Results for Wrapper Induction --- p.40
Chapter 5 --- Adaptation of Wrappers for Unseen Web Sites --- p.52
Chapter 5.1 --- Problem Definition --- p.52
Chapter 5.2 --- Overview of Wrapper Adaptation Framework --- p.55
Chapter 5.3 --- Potential Training Example Candidate Identification --- p.58
Chapter 5.3.1 --- Useful Text Fragments --- p.58
Chapter 5.3.2 --- Training Example Generation from the Unseen Web Site --- p.60
Chapter 5.3.3 --- Modified Nearest Neighbour Classification --- p.63
Chapter 5.4 --- Machine Annotated Training Example Discovery and New Wrap- per Learning --- p.64
Chapter 5.4.1 --- Text Fragment Classification --- p.64
Chapter 5.4.2 --- New Wrapper Learning --- p.69
Chapter 6 --- Case Study and Experimental Results for Wrapper Adapta- tion --- p.71
Chapter 6.1 --- Case Study on Wrapper Adaptation --- p.71
Chapter 6.2 --- Experimental Results --- p.73
Chapter 6.2.1 --- Book Domain --- p.74
Chapter 6.2.2 --- Consumer Electronic Appliance Domain --- p.79
Chapter 7 --- Conclusions and Future Work --- p.83
Bibliography --- p.88
Chapter A --- Detailed Performance of Wrapper Induction for Book Do- main --- p.95
Chapter B --- Detailed Performance of Wrapper Induction for Consumer Electronic Appliance Domain --- p.99
Sifuentes, Raul. "Determinación de las causas de búsquedas sin resultados en un catálogo bibliográfico en línea mediante el análisis de bitácoras de transacciones : el caso de la Pontificia Universidad Católica del Perú." Thesis, 2013. http://eprints.rclis.org/23857/1/SIFUENTES_ARROYO_RAUL_CATALOGO_BIBLIOGRAFICO.pdf.
Full text"Statistical modeling for lexical chains for automatic Chinese news story segmentation." 2010. http://library.cuhk.edu.hk/record=b5894500.
Full textThesis (M.Phil.)--Chinese University of Hong Kong, 2010.
Includes bibliographical references (leaves 106-114).
Abstracts in English and Chinese.
Abstract --- p.i
Acknowledgements --- p.v
Chapter 1 --- Introduction --- p.1
Chapter 1.1 --- Problem Statement --- p.2
Chapter 1.2 --- Motivation for Story Segmentation --- p.4
Chapter 1.3 --- Terminologies --- p.5
Chapter 1.4 --- Thesis Goals --- p.6
Chapter 1.5 --- Thesis Organization --- p.8
Chapter 2 --- Background Study --- p.9
Chapter 2.1 --- Coherence-based Approaches --- p.10
Chapter 2.1.1 --- Defining Coherence --- p.10
Chapter 2.1.2 --- Lexical Chaining --- p.12
Chapter 2.1.3 --- Cosine Similarity --- p.15
Chapter 2.1.4 --- Language Modeling --- p.19
Chapter 2.2 --- Feature-based Approaches --- p.21
Chapter 2.2.1 --- Lexical Cues --- p.22
Chapter 2.2.2 --- Audio Cues --- p.23
Chapter 2.2.3 --- Video Cues --- p.24
Chapter 2.3 --- Pros and Cons and Hybrid Approaches --- p.25
Chapter 2.4 --- Chapter Summary --- p.27
Chapter 3 --- Experimental Corpora --- p.29
Chapter 3.1 --- The TDT2 and TDT3 Multi-language Text Corpus --- p.29
Chapter 3.1.1 --- Introduction --- p.29
Chapter 3.1.2 --- Program Particulars and Structures --- p.31
Chapter 3.2 --- Data Preprocessing --- p.33
Chapter 3.2.1 --- Challenges of Lexical Chain Formation on Chi- nese Text --- p.33
Chapter 3.2.2 --- Word Segmentation for Word Units Extraction --- p.35
Chapter 3.2.3 --- Part-of-speech Tagging for Candidate Words Ex- traction --- p.36
Chapter 3.3 --- Chapter Summary --- p.37
Chapter 4 --- Indication of Lexical Cohesiveness by Lexical Chains --- p.39
Chapter 4.1 --- Lexical Chain as a Representation of Cohesiveness --- p.40
Chapter 4.1.1 --- Choice of Word Relations for Lexical Chaining --- p.41
Chapter 4.1.2 --- Lexical Chaining by Connecting Repeated Lexi- cal Elements --- p.43
Chapter 4.2 --- Lexical Chain as an Indicator of Story Segments --- p.48
Chapter 4.2.1 --- Indicators of Absence of Cohesiveness --- p.49
Chapter 4.2.2 --- Indicator of Continuation of Cohesiveness --- p.58
Chapter 4.3 --- Chapter Summary --- p.62
Chapter 5 --- Indication of Story Boundaries by Lexical Chains --- p.63
Chapter 5.1 --- Formal Definition of the Classification Procedures --- p.64
Chapter 5.2 --- Theoretical Framework for Segmentation Based on Lex- ical Chaining --- p.65
Chapter 5.2.1 --- Evaluation of Story Segmentation Accuracy --- p.65
Chapter 5.2.2 --- Previous Approach of Story Segmentation Based on Lexical Chaining --- p.66
Chapter 5.2.3 --- Statistical Framework for Story Segmentation based on Lexical Chaining --- p.69
Chapter 5.2.4 --- Post Processing of Ratio for Boundary Identifi- cation --- p.73
Chapter 5.3 --- Comparing Segmentation Models --- p.75
Chapter 5.4 --- Chapter Summary --- p.79
Chapter 6 --- Analysis of Lexical Chains Features as Boundary Indi- cators --- p.80
Chapter 6.1 --- Error Analysis --- p.81
Chapter 6.2 --- Window Length in the LRT Model --- p.82
Chapter 6.3 --- The Relative Importance of Each Set of Features --- p.84
Chapter 6.4 --- The Effect of Removing Timing Information --- p.92
Chapter 6.5 --- Chapter Summary --- p.96
Chapter 7 --- Conclusions and Future Work --- p.98
Chapter 7.1 --- Contributions --- p.98
Chapter 7.2 --- Future Works --- p.100
Chapter 7.2.1 --- Further Extension of the Framework --- p.100
Chapter 7.2.2 --- Wider Applications of the Framework --- p.105
Bibliography --- p.106
Cerdeirinha, João Manuel Macedo. "Recuperação de imagens digitais com base no conteúdo: estudo na Biblioteca de Arte e Arquivos da Fundação Calouste Gulbenkian." Master's thesis, 2019. http://hdl.handle.net/10362/91474.
Full textThe massive growth of multimedia data on the Internet and the emergence of new sharing platforms created major challenges for information retrieval. The limitations of text-based searches for this type of content have led to the development of a content-based information retrieval approach that has received increasing attention in recent decades. Taking into account the research carried out in this area, and digital images being the focus of this research, concepts and techniques associated with this approach are explored through a theoretical survey that reports the evolution of information retrieval and the importance that this subject has for Information Management and Curation. In the context of the systems that have been developed using automatic indexing, the various applications of this type of process are indicated. Available CBIR tools are also identified for a case study of the application of this type of image retrieval in the context of the Art Library and Archives of the Calouste Gulbenkian Foundation and the photographic collections that it holds in its resources, considering the particularities of the institution to which they belong. For the intended demonstration and according to the established criteria, online CBIR tools were initially used and, in the following phase, locally installed software was selected to search and retrieve in a specific collection. Through this case study, the strengths and weaknesses of content-based image retrieval are attested against the more traditional approach based on textual metadata currently in use in these collections. Taking into consideration the needs of users of the systems in which these digital objects are indexed, combining these techniques may lead to more satisfactory results.
Moreira, Walter. "Biblioteca tradicional x biblioteca virtual: modelos de recuperação da informação." Thesis, 1998. http://eprints.rclis.org/8353/1/BibliotecaTradicionalXBibliotecaVirtual_ModelosDeRecuperacaoDaInformacao.pdf.
Full textWächter, Thomas. "Semi-automated Ontology Generation for Biocuration and Semantic Search." Doctoral thesis, 2010. https://tud.qucosa.de/id/qucosa%3A25496.
Full text