Dissertations / Theses: 'Text retrieval'

1

Kay, Roderick Neil. "Text analysis, summarising and retrieval." Thesis, University of Salford, 1997. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.360435.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Lee, Hyo Sook. "Automatic text processing for Korean language free text retrieval." Thesis, University of Sheffield, 2000. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.322916.

Full text

APA, Harvard, Vancouver, ISO, and other styles

3

Asian, Jelita, and jelitayang@gmail com. "Effective Techniques for Indonesian Text Retrieval." RMIT University. Computer Science and Information Technology, 2007. http://adt.lib.rmit.edu.au/adt/public/adt-VIT20080110.084651.

Full text

Abstract:

The Web is a vast repository of data, and information on almost any subject can be found with the aid of search engines. Although the Web is international, the majority of research on finding of information has a focus on languages such as English and Chinese. In this thesis, we investigate information retrieval techniques for Indonesian. Although Indonesia is the fourth most populous country in the world, little attention has been given to search of Indonesian documents. Stemming is the process of reducing morphological variants of a word to a common stem form. Previous research has shown that stemming is language-dependent. Although several stemming algorithms have been proposed for Indonesian, there is no consensus on which gives better performance. We empirically explore these algorithms, showing that even the best algorithm still has scope for improvement. We propose novel extensions to this algorithm and develop a new Indonesian stemmer, and show that these can improve stemming correctness by up to three percentage points; our approach makes less than one error in thirty-eight words. We propose a range of techniques to enhance the performance of Indonesian information retrieval. These techniques include: stopping; sub-word tokenisation; and identification of proper nouns; and modifications to existing similarity functions. Our experiments show that many of these techniques can increase retrieval performance, with the highest increase achieved when we use grams of size five to tokenise words. We also present an effective method for identifying the language of a document; this allows various information retrieval techniques to be applied selectively depending on the language of target documents. We also address the problem of automatic creation of parallel corpora --- collections of documents that are the direct translations of each other --- which are essential for cross-lingual information retrieval tasks. Well-curated parallel corpora are rare, and for many languages, such as Indonesian, do not exist at all. We describe algorithms that we have developed to automatically identify parallel documents for Indonesian and English. Unlike most current approaches, which consider only the context and structure of the documents, our approach is based on the document content itself. Our algorithms do not make any prior assumptions about the documents, and are based on the Needleman-Wunsch algorithm for global alignment of protein sequences. Our approach works well in identifying Indonesian-English parallel documents, especially when no translation is performed. It can increase the separation value, a measure to discriminate good matches of parallel documents from bad matches, by approximately ten percentage points. We also investigate the applicability of our identification algorithms for other languages that use the Latin alphabet. Our experiments show that, with minor modifications, our alignment methods are effective for English-French, English-German, and French-German corpora, especially when the documents are not translated. Our technique can increase the separation value for the European corpus by up to twenty-eight percentage points. Together, these results provide a substantial advance in understanding techniques that can be applied for effective Indonesian text retrieval.

APA, Harvard, Vancouver, ISO, and other styles

4

Shokouhi, Milad, and milads@microsoft com. "Federated Text Retrieval from Independent Collections." RMIT University. Computer Science and Information Technology, 2008. http://adt.lib.rmit.edu.au/adt/public/adt-VIT20080521.151632.

Full text

Abstract:

Federated information retrieval is a technique for searching multiple text collections simultaneously. Queries are submitted to a subset of collections that are most likely to return relevant answers. The results returned by selected collections are integrated and merged into a single list. Federated search is preferred over centralized search alternatives in many environments. For example, commercial search engines such as Google cannot index uncrawlable hidden web collections; federated information retrieval systems can search the contents of hidden web collections without crawling. In enterprise environments, where each organization maintains an independent search engine, federated search techniques can provide parallel search over multiple collections. There are three major challenges in federated search. For each query, a subset of collections that are most likely to return relevant documents are selected. This creates the collection selection problem. To be able to select suitable collections, federated information retrieval systems acquire some knowledge about the contents of each collection, creating the collection representation problem. The results returned from the selected collections are merged before the final presentation to the user. This final step is the result merging problem. In this thesis, we propose new approaches for each of these problems. Our suggested methods, for collection representation, collection selection, and result merging, outperform state-of-the-art techniques in most cases. We also propose novel methods for estimating the number of documents in collections, and for pruning unnecessary information from collection representations sets. Although management of document duplication has been cited as one of the major problems in federated search, prior research in this area often assumes that collections are free of overlap. We investigate the effectiveness of federated search on overlapped collections, and propose new methods for maximizing the number of distinct relevant documents in the final merged results. In summary, this thesis introduces several new contributions to the field of federated information retrieval, including practical solutions to some historically unsolved problems in federated search, such as document duplication management. We test our techniques on multiple testbeds that simulate both hidden web and enterprise search environments.

APA, Harvard, Vancouver, ISO, and other styles

5

Nwesri, Abdusalam F. Ahmad, and nwesri@yahoo com. "Effective retrieval techniques for Arabic text." RMIT University. Computer Science and IT, 2008. http://adt.lib.rmit.edu.au/adt/public/adt-VIT20081204.163422.

Full text

Abstract:

Arabic is a major international language, spoken in more than 23 countries, and the lingua franca of the Islamic world. The number of Arabic-speaking Internet users has grown over nine-fold in the Middle East between the year 2000 and 2007, yet research in Arabic Information Retrieval (AIR) has not advanced as in other languages such as English. In this thesis, we explore techniques that improve the performance of AIR systems. Stemming is considered one of the most important factors to improve retrieval effectiveness of AIR systems. Most current stemmers remove affixes without checking whether the removed letters are actually affixes. We propose lexicon-based improvements to light stemming that distinguish core letters from proper Arabic affixes. We devise rules to stem most affixes and show their effects on retrieval effectiveness. Using the TREC 2001 test collection, we show that applying relevance feedback with our rules produces significantly better results than light stemming. Techniques for Arabic information retrieval have been studied in depth on clean collections of newswire dispatches. However, the effectiveness of such techniques is not known on other noisy collections in which text is generated using automatic speech recognition (ASR) systems and queries are generated using machine translations (MT). Using noisy collections, we show that normalisation, stopping and light stemming improve results as in normal text collections but that n-grams and root stemming decrease performance. Most recent AIR research has been undertaken using collections that are far smaller than the collections used for English text retrieval; consequently, the significance of some published results is debatable. Using the LDC Arabic GigaWord collection that contains more than 1 500 000 documents, we create a test collection of~90 topics with their relevance judgements. Using this test collection, we show empirically that for a large collection, root stemming is not competitive. Of the approaches we have studied, lexicon-based stemming approaches perform better than light stemming approaches alone. Arabic text commonly includes foreign words transliterated into Arabic characters. Several transliterated forms may be in common use for a single foreign word, but users rarely use more than one variant during search tasks. We test the effectiveness of lexicons, Arabic patterns, and n-grams in distinguishing foreign words from native Arabic words. We introduce rules that help filter foreign words and improve the n-gram approach used in language identification. Our combined n-grams and lexicon approach successfully identifies 80% of all foreign words with a precision of 93%. To find variants of a specific foreign word, we apply phonetic and string similarity techniques and introduce novel algorithms to normalise them in Arabic text. We modify phonetic techniques used for English to suit the Arabic language, and compare several techniques to determine their effectiveness in finding foreign word variants. We show that our algorithms significantly improve recall. We also show that expanding queries using variants identified by our Soutex4 phonetic algorithm results in a significant improvement in precision and recall. Together, the approaches described in this thesis represent an important step towards realising highly effective retrieval of Arabic text.

APA, Harvard, Vancouver, ISO, and other styles

6

De, Luca Ernesto William. "Semantic support in multilingual text retrieval." Aachen Shaker, 2008. http://d-nb.info/990194914/04.

Full text

APA, Harvard, Vancouver, ISO, and other styles

7

Viana, Hugo Henrique Amorim. "Automatic information retrieval through text-mining." Master's thesis, Faculdade de Ciências e Tecnologia, 2013. http://hdl.handle.net/10362/11308.

Full text

Abstract:

The dissertation presented for obtaining the Master’s Degree in Electrical Engineering and Computer Science, at Universidade Nova de Lisboa, Faculdade de Ciências e Tecnologia
Nowadays, around a huge amount of firms in the European Union catalogued as Small and Medium Enterprises (SMEs), employ almost a great portion of the active workforce in Europe. Nonetheless, SMEs cannot afford implementing neither methods nor tools to systematically adapt innovation as a part of their business process. Innovation is the engine to be competitive in the globalized environment, especially in the current socio-economic situation. This thesis provides a platform that when integrated with ExtremeFactories(EF) project, aids SMEs to become more competitive by means of monitoring schedule functionality. In this thesis a text-mining platform that possesses the ability to schedule a gathering information through keywords is presented. In order to develop the platform, several choices concerning the implementation have been made, in the sense that one of them requires particular emphasis is the framework, Apache Lucene Core 2 by supplying an efficient text-mining tool and it is highly used for the purpose of the thesis.

APA, Harvard, Vancouver, ISO, and other styles

8

Krishnan, Sharenya. "Text-Based Information Retrieval Using Relevance Feedback." Thesis, KTH, Skolan för informations- och kommunikationsteknik (ICT), 2011. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-53603.

Full text

Abstract:

Europeana, a freely accessible digital library with an idea to make Europe's cultural and scientific heritage available to the public was founded by the European Commission in 2008. The goal was to deliver a semantically enriched digital content with multilingual access to it. Even though they managed to increase the content of data they slowly faced the problem of retrieving information in an unstructured form. So to complement the Europeana portal services, ASSETS (Advanced Search Service and Enhanced Technological Solutions) was introduced with services that sought to improve the usability and accessibility of Europeana. My contribution is to study different text-based information retrieval models, their relevance feedback techniques and to implement one simple model. The thesis explains a detailed overview of the information retrieval process along with the implementation of the chosen strategy for relevance feedback that generates automatic query expansion. Finally, the thesis concludes with the analysis made using relevance feedback, discussion on the model implemented and then an assessment on future use of this model both as a continuation of my work and using this model in ASSETS.

APA, Harvard, Vancouver, ISO, and other styles

9

Westmacott, Mike. "Content based image retrieval : analogies with text." Thesis, University of Southampton, 2004. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.423038.

Full text

APA, Harvard, Vancouver, ISO, and other styles

10

Murad, Masrah Azrifah Azmi. "Fuzzy text mining for intelligent information retrieval." Thesis, University of Bristol, 2005. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.416830.

Full text

APA, Harvard, Vancouver, ISO, and other styles

11

Isbell, Charles L. (Charles Lee). "Sparse multi-level representations for text retrieval." Thesis, Massachusetts Institute of Technology, 1998. http://hdl.handle.net/1721.1/47513.

Full text

Abstract:

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1998.
Includes bibliographical references (p. [153]-160).
by Charles Lee Isbell, Junior.
Ph.D.

APA, Harvard, Vancouver, ISO, and other styles

12

Kyriakides, Alexandros 1977. "Supervised information retrieval for text and images." Thesis, Massachusetts Institute of Technology, 2004. http://hdl.handle.net/1721.1/28426.

Full text

Abstract:

Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2004.
Includes bibliographical references (leaves 73-74).
We present a novel approach to choosing an appropriate image for a news story. Our method uses the caption of the image to retrieve a suitable image. We have developed a word-extraction engine called WordEx. WordEx uses supervised learning to predict which words in the text of a news story are likely to be present in the caption of an appropriate image. The words extracted by WordEx are then used to retrieve the image from a collection of images. On average, the number of words extracted by WordEx is 10% of the original story text. Therefore, this word-extraction engine can also be applied to text documents for feature reduction.
by Alexandros Kyriakides.
M.Eng.

APA, Harvard, Vancouver, ISO, and other styles

13

Landrin-Schweitzer, Yann. "Algorithmes génétiques interactifs pour le text-retrieval." Paris 11, 2003. http://www.theses.fr/2003PA112303.

Full text

Abstract:

L'inflation des quantités de documents électroniques, sur internet et dans les intranets d'entreprises a entraîné au cours des années 1990-2000 un important développement des moteurs de recherche textuelle (text-retrieval). Leurs performances actuelles élevées, reposant sur l'emploi d'outils linguistiques et sémantiques très spécialisés, se heurtent à d'ultimes barrières: les particularités individuelles des utilisateurs, et l'important effort qu'ils doivent fournir pour interpréter les informations reçues. Les approches statistiques, reposant sur des modèles cognitifs, ont prouvé leur efficacité dans des situations au contexte sémantique simple. Nous avons abordé cette question en développant la spécificité du comportement des outils pour chaque utilisateur. A défaut de modèles cognitifs satisfaisants pour tous les types d'utilisateurs, permettant de contraindre les types de réponse acceptables pour chaque requête, nous avons formé un modèle de prétraitement de requêtes réalisable pour obtenir ces réponses. Le traitement à effectuer est contenu dans un profil utilisateur. Ce profil est adapté dynamiquement aux comportements de l'utilisateur grâce à un algorithme évolutionnaire, maximisant une évaluation de satisfaction dans les résultats produits. L'approche de programmation génétique utilisée pour cette optimisation repose sur une approche parisienne, optimisant une population de modules. Ceux-ci sont les composants élémentaires de règles de transformation, permettant de réécrire la requête de l'utilisateur. Après ce traitement, un composant de recherche d'un système d'extraction textuel commercial permet l'obtention des listes de résultats, de manière invisible pour l'utilisateur. Un prototype fonctionnel, Elise, a été développé. Si la performance de celui-ci, liée aux opinions des utilisateurs, est d'évaluation délicate, les résultats obtenus montrent des capacités d'adaptation et de créativité absentes des systèmes traditionnels
The number and volume of documents available in electronical form has skyrocketed during the '90s. A consequence is the development of archiving and management tools for electronic documents. Among those, textual search engines have taken a major role in the treatment and diffusion of information. Those tools have nowadays very high performances, based on specialized linguistic tools. However, they reach new limits: the particularities of their users, and the complexity of information processing. Statistical approaches, based on cognitive user models, have proven themselves on simple semantical contexts. They still fail to endow textual extraction tools with the capacities of user specificity and adaptability. We attempt to overcome this limitation by specializing the behaviour of text-retrieval tools to the specificities of users. Without an appropriate cognitive model applicable to all users, that would let us constrain the answers that should be given to users, we propose a model of the treatment we may apply to their requests. We dynamically adapt a profile containing this information with an evolutionary algorithm, that maximizes the satisfaction of the user in the results obtained. Applying the parisian approach to this genetic programming core leads to optimise a population of modules, elementary components of transformation rules. We obtain actual result lists through a classical text extraction tool, invisibly for the user. A working prototype, Elise, has been implemented. Evaluating its performance, based on the opinion of users, is tricky, but the tests show that Elise is capable of adaptation and creativity, of which traditional systems are incapable

APA, Harvard, Vancouver, ISO, and other styles

14

Sussna, Michael John. "Text retrieval using inference in semantic metanetworks /." Diss., Connect to a 24 p. preview or request complete full text in PDF format. Access restricted to UC campuses, 1997. http://wwwlib.umi.com/cr/ucsd/fullcit?p9726031.

Full text

APA, Harvard, Vancouver, ISO, and other styles

15

Vinay, V. "The relevance of feedback for text retrieval." Thesis, University College London (University of London), 2007. http://discovery.ucl.ac.uk/1446146/.

Full text

Abstract:

Relevance Feedback is a technique that helps an Information Retrieval system modify a query in response to relevance judgements provided by the user about individual results dis played after an initial retrieval. This thesis begins by proposing an evaluation framework for measuring the effectiveness of feedback algorithms. The simulation-based method in volves a brute force exploration of the outcome of every possible user action. Starting from an initial state, each available alternative is represented as a traversal along one branch of a user decision tree. The use of the framework is illustrated in two situations---searching on devices with small displays and for web search. Three well known RF algorithms, Rocchio, Robertson/Sparck-Jones (RSJ) and Bayesian, are compared for these applications. For small display devices, the algorithms are evaluated in conjunction with two strate gies for presenting search results: the top-D ranked documents and a document ranking that attempts to maximise information gain from the user's choices. Experimental results in dicate that for RSJ feedback which involves an explicit feature selection policy, the greedy top-D display is more appropriate. For the other two algorithms, the exploratory display that maximises information gain produces better results. A user study was conducted to evaluate the performance of the relevance feedback methods with real users and compare the results with the findings from the tree analysis. This comparison between the simulations and real user behaviour indicates that the Bayesian algorithm, coupled with the sampled display, is the most effective. For web-search, two possible representations for web-pages are considered---the textual content of the page and the anchor text of hyperlinks into this page. Results indicate that there is a significant variation in the upper-bound performance of the three RF algorithms and that the Bayesian algorithm approaches the best possible. The relative performance of the three algorithms differed in the two sets of experiments. All other factors being constant, this difference in effectiveness was attributed to the fact that the datasets used in the two cases were different. Also, at a more general level, a relationship was observed between the performance of the original query and benefits of subsequent relevance feedback. The remainder of the thesis looks at properties that characterise sets of documents with the particular aim of identifying measures that are predictive of future performance of statis tical algorithms on these document sets. The central hypothesis is that a set of points (corresponding to documents) are difficult if they lack structure. Three properties are identified---the clustering tendency, sensitivity to perturbation and the local intrinsic dimensionality. The clustering tendency reflects the presence or absence of natural groupings within the data. Perturbation analysis looks at the sensitivity of the similarity metric to small changes in the input. The correlation present in sets of points is measured by the local intrinsic dimensionality therefore indicating the randomness present in them. These properties are shown to be useful for two tasks, namely, measuring the complexity of text datasets and for query performance prediction.

APA, Harvard, Vancouver, ISO, and other styles

16

Holmes-Higgin, Paul. "Text knowledge : the Quirk Experiments." Thesis, University of Surrey, 1995. http://epubs.surrey.ac.uk/842732/.

Full text

Abstract:

Our research examines text knowledge: the knowledge encoded in text and the knowledge about a text. We approach text knowledge from different perspectives, describing the theories and techniques that have been applied to extracting, representing and deploying this knowledge, and propose some novel techniques that may enhance the understanding of text knowledge. These techniques include the concept of virtual corpus hierarchies, hybrid symbolic and connectionist representation and reasoning, text analysis and self-organising corpora. We present these techniques in a framework that embraces the different facets of text knowledge as a whole, be it corpus organisation and text identification, text analysis, or knowledge representation and reasoning. This framework comprises three phases, that of organisation, analysis and evaluation of text, where a single text might be a complete work, a technical term, or even a single letter. The techniques proposed are demonstrated by implementations of computer systems and some experiments based on these implementations: the Quirk Experiments. Through these experiments we show how the highly interconnected nature of text knowledge can be reduced or abstracted for specific purposes, from a range of techniques based on explicit symbolic representations and self-organising connectionist schemes.

APA, Harvard, Vancouver, ISO, and other styles

17

Song, Min Song Il-Yeol. "Robust knowledge extraction over large text collections /." Philadelphia, Pa. : Drexel University, 2005. http://dspace.library.drexel.edu/handle/1860/495.

Full text

APA, Harvard, Vancouver, ISO, and other styles

18

Caviglia, Karen. "Signature file access methodologies for text retrieval : a literature review with additional test cases /." Online version of thesis, 1987. http://hdl.handle.net/1850/10144.

Full text

APA, Harvard, Vancouver, ISO, and other styles

19

Estall, Craig. "A study in distributed document retrieval." Thesis, Queen's University Belfast, 1985. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.328342.

Full text

APA, Harvard, Vancouver, ISO, and other styles

20

Käter, Thorsten. "Evaluierung des Text-Retrievalsystems "Intelligent Miner for Text" von IBM : eine Studie im Vergleich zur Evaluierung anderer Systeme /." [S.l. : s.n.], 1999. http://www.bsz-bw.de/cgi-bin/xvms.cgi?SWB8230685.

Full text

APA, Harvard, Vancouver, ISO, and other styles

21

Kim, Eungi. "Implications of Punctuation Mark Normalization on Text Retrieval." Thesis, University of North Texas, 2013. https://digital.library.unt.edu/ark:/67531/metadc500160/.

Full text

Abstract:

This research investigated issues related to normalizing punctuation marks from a text retrieval perspective. A punctuated-centric approach was undertaken by exploring changes in meanings, whitespaces, words retrievability, and other issues related to normalizing punctuation marks. To investigate punctuation normalization issues, various frequency counts of punctuation marks and punctuation patterns were conducted using the text drawn from the Gutenberg Project archive and the Usenet Newsgroup archive. A number of useful punctuation mark types that could aid in analyzing punctuation marks were discovered. This study identified two types of punctuation normalization procedures: (1) lexical independent (LI) punctuation normalization and (2) lexical oriented (LO) punctuation normalization. Using these two types of punctuation normalization procedures, this study discovered various effects of punctuation normalization in terms of different search query types. By analyzing the punctuation normalization problem in this manner, a wide range of issues were discovered such as: the need to define different types of searching, to disambiguate the role of punctuation marks, to normalize whitespaces, and indexing of punctuated terms. This study concluded that to achieve the most positive effect in a text retrieval environment, normalizing punctuation marks should be based on an extensive systematic analysis of punctuation marks and punctuation patterns and their related factors. The results of this study indicate that there were many challenges due to complexity of language. Further, this study recommends avoiding a simplistic approach to punctuation normalization.

APA, Harvard, Vancouver, ISO, and other styles

22

Zhang, Nan. "TRANSFORM BASED AND SEARCH AWARE TEXT COMPRESSION SCHEMES AND COMPRESSED DOMAIN TEXT RETRIEVAL." Doctoral diss., University of Central Florida, 2005. http://digital.library.ucf.edu/cdm/ref/collection/ETD/id/3938.

Full text

Abstract:

In recent times, we have witnessed an unprecedented growth of textual information via the Internet, digital libraries and archival text in many applications. While a good fraction of this information is of transient interest, useful information of archival value will continue to accumulate. We need ways to manage, organize and transport this data from one point to the other on data communications links with limited bandwidth. We must also have means to speedily find the information we need from this huge mass of data. Sometimes, a single site may also contain large collections of data such as a library database, thereby requiring an efficient search mechanism even to search within the local data. To facilitate the information retrieval, an emerging ad hoc standard for uncompressed text is XML which preprocesses the text by putting additional user defined metadata such as DTD or hyperlinks to enable searching with better efficiency and effectiveness. This increases the file size considerably, underscoring the importance of applying text compression. On account of efficiency (in terms of both space and time), there is a need to keep the data in compressed form for as much as possible. Text compression is concerned with techniques for representing the digital text data in alternate representations that takes less space. Not only does it help conserve the storage space for archival and online data, it also helps system performance by requiring less number of secondary storage (disk or CD Rom) accesses and improves the network transmission bandwidth utilization by reducing the transmission time. Unlike static images or video, there is no international standard for text compression, although compressed formats like .zip, .gz, .Z files are increasingly being used. In general, data compression methods are classified as lossless or lossy. Lossless compression allows the original data to be recovered exactly. Although used primarily for text data, lossless compression algorithms are useful in special classes of images such as medical imaging, finger print data, astronomical images and data bases containing mostly vital numerical data, tables and text information. Many lossy algorithms use lossless methods at the final stage of the encoding stage underscoring the importance of lossless methods for both lossy and lossless compression applications. In order to be able to effectively utilize the full potential of compression techniques for the future retrieval systems, we need efficient information retrieval in the compressed domain. This means that techniques must be developed to search the compressed text without decompression or only with partial decompression independent of whether the search is done on the text or on some inversion table corresponding to a set of key words for the text. In this dissertation, we make the following contributions: (1) Star family compression algorithms: We have proposed an approach to develop a reversible transformation that can be applied to a source text that improves existing algorithm's ability to compress. We use a static dictionary to convert the English words into predefined symbol sequences. These transformed sequences create additional context information that is superior to the original text. Thus we achieve some compression at the preprocessing stage. We have a series of transforms which improve the performance. Star transform requires a static dictionary for a certain size. To avoid the considerable complexity of conversion, we employ the ternary tree data structure that efficiently converts the words in the text to the words in the star dictionary in linear time. (2) Exact and approximate pattern matching in Burrows-Wheeler transformed (BWT) files: We proposed a method to extract the useful context information in linear time from the BWT transformed text. The auxiliary arrays obtained from BWT inverse transform brings logarithm search time. Meanwhile, approximate pattern matching can be performed based on the results of exact pattern matching to extract the possible candidate for the approximate pattern matching. Then fast verifying algorithm can be applied to those candidates which could be just small parts of the original text. We present algorithms for both k-mismatch and k-approximate pattern matching in BWT compressed text. A typical compression system based on BWT has Move-to-Front and Huffman coding stages after the transformation. We propose a novel approach to replace the Move-to-Front stage in order to extend compressed domain search capability all the way to the entropy coding stage. A modification to the Move-to-Front makes it possible to randomly access any part of the compressed text without referring to the part before the access point. (3) Modified LZW algorithm that allows random access and partial decoding for the compressed text retrieval: Although many compression algorithms provide good compression ratio and/or time complexity, LZW is the first one studied for the compressed pattern matching because of its simplicity and efficiency. Modifications on LZW algorithm provide the extra advantage for fast random access and partial decoding ability that is especially useful for text retrieval systems. Based on this algorithm, we can provide a dynamic hierarchical semantic structure for the text, so that the text search can be performed on the expected level of granularity. For example, user can choose to retrieve a single line, a paragraph, or a file, etc. that contains the keywords. More importantly, we will show that parallel encoding and decoding algorithm is trivial with the modified LZW. Both encoding and decoding can be performed with multiple processors easily and encoding and decoding process are independent with respect to the number of processors.
Ph.D.
School of Computer Science
Engineering and Computer Science
Computer Science

APA, Harvard, Vancouver, ISO, and other styles

23

DeLuca, Ernesto W. [Verfasser]. "Semantic Support in Multilingual Text Retrieval / Ernesto W DeLuca." Aachen : Shaker, 2008. http://d-nb.info/1161313745/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

24

De, Luca Ernesto William [Verfasser]. "Semantic Support in Multilingual Text Retrieval / Ernesto W DeLuca." Aachen : Shaker, 2008. http://nbn-resolving.de/urn:nbn:de:101:1-2018061708435800872248.

Full text

APA, Harvard, Vancouver, ISO, and other styles

25

Brucato, Matteo. "Temporal Information Retrieval." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2013. http://amslaurea.unibo.it/5690/.

Full text

APA, Harvard, Vancouver, ISO, and other styles

26

Al, Tayyar Musaid Seleh. "Arabic information retrieval system based on morphological analysis (AIRSMA) : a comparative study of word, stem, root and morpho-semantic methods." Thesis, De Montfort University, 2000. http://hdl.handle.net/2086/4126.

Full text

APA, Harvard, Vancouver, ISO, and other styles

27

Hou, Jun. "Text mining with semantic annotation : using enriched text representation for entity-oriented retrieval, semantic relation identification and text clustering." Thesis, Queensland University of Technology, 2014. https://eprints.qut.edu.au/79206/1/Jun_Hou_Thesis.pdf.

Full text

Abstract:

This project is a step forward in the study of text mining where enhanced text representation with semantic information plays a significant role. It develops effective methods of entity-oriented retrieval, semantic relation identification and text clustering utilizing semantically annotated data. These methods are based on enriched text representation generated by introducing semantic information extracted from Wikipedia into the input text data. The proposed methods are evaluated against several start-of-art benchmarking methods on real-life data-sets. In particular, this thesis improves the performance of entity-oriented retrieval, identifies different lexical forms for an entity relation and handles clustering documents with multiple feature spaces.

APA, Harvard, Vancouver, ISO, and other styles

28

Ball, Liezl Hilde. "Enhancing digital text collections with detailed metadata to improve retrieval." Thesis, University of Pretoria, 2020. http://hdl.handle.net/2263/79015.

Full text

Abstract:

Digital text collections are increasingly important, as they enable researchers to explore new ways of interacting with texts through the use of technology. Various tools have been developed to facilitate exploring and searching in text collections at a fairly low level of granularity. Ideally, it should be possible to filter the results at a greater level of granularity to retrieve only specific instances in which the researcher is interested. The aim of this study was to investigate to what extent detailed metadata could be used to enhance texts in order to improve retrieval. To do this, the researcher had to identify metadata that could be useful to filter according to and find ways in which these metadata can be applied to or encoded in texts. The researcher also had to evaluate existing tools to determine to what extent current tools support retrieval on a fine-grained level. After identifying useful metadata and reviewing existing tools, the researcher could suggest a metadata framework that could be used to encode texts on a detailed level. Metadata in five different categories were used, namely morphological, syntactic, semantic, functional and bibliographic. A further contribution in this metadata framework was the addition of in-text bibliographic metadata, to use where sections in a text have different properties than those in the main text. The suggested framework had to be tested to determine if retrieval was indeed improved. In order to do so, a selection of texts was encoded with the suggested framework and a prototype was developed to test the retrieval. The prototype receives the encoded texts and stores the information in a database. A graphical user interface was developed to enable searching in the database in an easy and intuitive manner. The prototype demonstrates that it is possible to search for words or phrases with specific properties when detailed metadata are applied to texts. The fine-grained metadata from five different categories enable retrieval on a greater level of granularity and specificity. It is therefore recommended that detailed metadata are used to encode texts in order to improve retrieval in digital text collections. Keywords: metadata, digital humanities, digital text collections, retrieval, encoding
Thesis (DPhil (Information Science))--University of Pretoria, 2020.
Information Science
DPhil (Information Science)
Unrestricted

APA, Harvard, Vancouver, ISO, and other styles

29

Zhou, Xiaohua Hu Xiaohua. "Semantics-based language models for information retrieval and text mining /." Philadelphia, Pa. : Drexel University, 2008. http://hdl.handle.net/1860/2931.

Full text

APA, Harvard, Vancouver, ISO, and other styles

30

Mick, Alan A. "Knowledge based text indexing and retrieval utilizing case based reasoning /." Online version of thesis, 1994. http://hdl.handle.net/1850/11715.

Full text

APA, Harvard, Vancouver, ISO, and other styles

31

Dick, Judith P. "A conceptual, case-relation representation of text for intelligent retrieval." Ottawa : National Library of Canada = Bibliothèque nationale du Canada, 1992. http://books.google.com/books?id=Zh3hAAAAMAAJ.

Full text

APA, Harvard, Vancouver, ISO, and other styles

32

Greco, Luca. "Text retrieval and categorization through a weighted word pairs approach." Doctoral thesis, Universita degli studi di Salerno, 2013. http://hdl.handle.net/10556/981.

Full text

Abstract:

2011 - 2012
The focus of this dissertation is the development and validation of a novel method for supervised text classification to be used effectively when small sized training sets are available. The proposed approach, which relies on a Weighted Word Pairs (WWP) structure, has been validated in two application fields: Query Expansion and Text Categorization. By analyzing the state of the art for supervised text classification, it has been observed that existing methods show a drastic performance decrease when the number of training examples is reduced. This behaviour is essentialy due to the following reasons: the use, common to most existing systems, of the "Bag of Words" model where only the presence and occurrence of words in texts is considered, losing any information about the position; polysemy and ambiguity which are typical of natural language; the performance degradation affecting classification systems when the number of features is much greater than the available training samples. Nevertheless, manual document classification is a boring, costly and slow process: it has been observed that only 100 documents can be hand-labeled in 90 minutes and this number may be not sufficient for achieving good accuracy in real contexts with a standard trained classifier. On the other hand, in Query Expansion problems (in the domain of interactive web search engines), where the user is asked to provide a relevance feedback to refine the search process, the number of selected documents is much less than the total number of indexed documents. Hence, there's a great interest in alternative classification methods which, using more complex structures than a simple list of words, show higher efficiency when learning from a few training documents. The proposed approach is based on a hierarchical structure, called Weighted Word Pairs (WWP), that can be learned automatically from a corpus of documents and relies on two fundamental entities: aggregate roots i.e. the words probabilistically more implied from all others; aggregates which are words having a greater probabilistic correlation with aggregate roots. WWP structure learning takes place through three main phases: the first phase is characterized by the use of probabilistic topic model and Latent Dirichlet Allocation to compute the probability distribution of words within documents: in particular, the output of LDA algorithm consists of two matrices that define the probabilistic relationship between words, topics and the documents. Under suitable assumptions, the probability of the occurrence of each word in the corpus, the conditional and joint probabilities between word pairs can be derived from these matrices. During the second phase, aggregate roots (whose number is selected by the user as an external parameter) are chosen as those words that maximize the conditional probability product between a given word and all others, in line with the definition given above. Once aggregate roots have been chosen, each of them is associated with some aggregates and the coefficient of relationship between aggregate roots and aggregates is calculated thanks to the joint probability between word pairs (previously computed). The number of links between aggregate roots and aggregates depends on another external parameter (Max Pairs) which affects proper thresholds allowing to filter weakly correlated pairs. The third phase is aimed at searching the optimal WWP structure, which has to provide a synthetic representation for the information contained in all the documents (not only into a subset of them). The effectiveness of the WWP structure was initially assessed in Query Expansion problems, in the context of interactive search engines. In this scenario, the user, after getting from the system a first ranking of documents in response to a specific query, is asked to select some relevant documents as a feedback documents in response to a specific query, is asked to select some relevant documents as a feedback, according to his information need. From those documents (relevance feedback), some key terms are extracted to expand the initial query and refine the search. In our case, a WWP structure is extracted from the relevance feedback and is appropriately translated into a query. The experimental phase for this application context was conducted with the use of TREC-8 standard dataset, which consists of approximately 520 thousand pre-classified documents. A performance comparison between the baseline (results obtained with no expanded query), WWP structure and a query expansion method based on the Kullback Leibler divergence was carried out. Typical information retrieval measurement were computed: precision at various levels, mean average precision, binary preference, R-precision. The evaluation of these measurements was performed using a standard evaluation tool used for TREC conferences. The results obtained are very encouraging. A further application field for validating WWP structure is documents categorization. In this case, a WWP structure combined with a standard Information Retrieval module is used to implement a document-ranking text classifier. Such a classifier is able to make a soft decision: it draws up a ranking of documents that requires the choice of an appropriate threshold (Categorization Status Value) in order to obtain a binary classification. In our case, this threshold was chosen by evaluating performance on a validation set in terms of micro-precision, micro-recall and micro-F1. The dataset Reuters-21578, consisting of about 21 thousand newspaper articles, has been used; in particular, evaluation was performed on the ModApte split (10 categories), which includes only manually classified documents. The experiment was carried out by selecting randomly the 1% of the training set available for each category and this selection was made 100 times so that the results were not biased by the specific subset. The performance, evaluated by calculating the F1 measure (harmonic mean of precision and recall), was compared with the Support Vector Machines, in the literature referred as the state of the art in the classification of such a dataset. The results show that when the training set is reduced to 1%, the performance of the classifier based on WWP are on average higher than those of SVM. [edited by author]
Il focus dell’attività di ricerca riguarda lo sviluppo e la validazione di una metodologia alternativa per la classificazione supervisionata di testi mediante impiego di training set di dimensioni ridotte (circa l’1% rispetto a quelli tipicamente impiegati). L’approccio proposto, che si basa su una struttura a coppie di parole pesate (Weighted Word Pairs), è stato validato su due contesti applicativi: Query Expansion e Text Categorization. Da un’accurata analisi dello stato dell’arte in materia di classificazione supervisionata dei testi, si è evinto come le metodologie esistenti mostrino un evidente calo di prestazioni in presenza di una riduzione degli esempi (campioni del data set già classificati) utilizzati per l’addestramento. Tale calo è essenzialmente attribuibile alle seguenti cause: l’impiego, comune a gran parte dei sistemi esistenti, del modello “Bag of Words” dove si tiene conto della sola presenza ed occorrenza delle singole parole nei testi, perdendo qualsiasi informazione circa la posizione; polisemia ed ambiguità tipiche del linguaggio naturale; il peggioramento delle prestazioni che coinvolge i sistemi di classificazione quando il numero di caratteristiche (features) impiegate è molto maggiore degli esempi disponibili per l’addestramento del sistema. Dal punto di vista delle applicazioni, ci si trova spesso di fronte a casi in cui, per la classificazione di un corpus di documenti, si ha a disposizione un insieme limitato di esempi: questo perché il processo di classificazione manuale dei documenti è oneroso e lento. D’altro canto in problemi di Query Expansion, nell’ambito dei motori di ricerca interattivi, dove l’utente è chiamato a fornire un feedback di rilevanza per raffinare il processo di ricerca, il numero di documenti selezionati è molto inferiore al totale dei documenti indicizzati dal motore. Da qui l’interesse verso strategie di classificazione che, usando strutture più complesse rispetto alla semplice lista di parole, mostrino un’efficienza maggiore quando la struttura è appresa da pochi documenti di training. L’approccio proposto si basa su una struttura gerarchica (Weighted Word Pairs) che può essere appresa automaticamente da un corpus di documenti e che è costituita da due entità fondamentali: i termini aggregatori che sono le parole probabilisticamente più implicate da tutte le altre; i termini aggregati che sono le parole aventi maggiore correlazione probabilistica con i termini aggregatori. L’apprendimento della struttura WWP avviene attraverso tre fasi principali: la prima fase è caratterizzata dall’impiego del topic model probabilistico e della Latent Dirichlet Allocation per il calcolo della distribuzione probabilistica delle parole all’interno dei documenti: in particolare, l’output dell’algoritmo LDA è costituito da due matrici che definiscono il legame probabilistico tra le parole, i topic e i documenti analizzati. Sotto opportune ipotesi è possibile derivare da tali matrici le probabilità associate al verificarsi delle singole parole all’interno del corpus e le probabilità condizionate e congiunte tra le coppie di parole; durante la seconda fase vengono scelti i termini aggregatori (il cui numero è selezionato dall’utente come parametro esterno) come quelle parole che massimizzano il prodotto delle probabilità condizionate al verificarsi di tutte le altre, coerentemente con la definizione fornita in precedenza. Una volta scelti i termini aggregatori, a ciascuno di essi sono associati dei termini aggregati e il coefficiente di relazione tra termini aggregatori ed aggregati è calcolato sulla base della probabilità congiunta. Il numero di legami tra aggregatori e tra aggregatori/aggregati dipende da un parametro esterno (Max Pairs) che va ad influire su opportune soglie che filtrano le coppie debolmente correlate. La terza fase ha come obiettivo la ricerca della struttura WWP ottima, che tenga conto dell’informazione presente in tutti i documenti del corpus e che non sia maggiormente caratterizzata da un sottoinsieme di essi. L’efficacia della struttura WWP è stata dapprima valutata in problemi di Query Expansion nell’ambito dei motori di ricerca interattivi. In questo scenario l’utente, dopo aver ottenuto dal sistema un primo ranking di documenti in risposta ad una sua query iniziale, è chiamato a selezionare alcuni documenti da lui giudicati rilevanti che andranno a costituire il relevance feedback da cui estrarre opportunamente nuovi termini per espandere la query iniziale e raffinare la ricerca. Nel caso specifico, la struttura WWP appresa dal relevance feedback viene opportunamente tradotta in una query mediante un linguaggio di interrogazione proprio del modulo di Information Retrieval utilizzato. La sperimentazione in questo contesto applicativo è stata condotta mediante l’utilizzo del dataset standard TREC-8, costituito da circa 520 mila documenti pre-classificati. E’ stato effettuato un confronto di performance tra la baseline ( risultati ottenuti da query priva di espansione), la struttura WWP ed un metodo di espansione basato sulla Divergenza di Kullback Leibler, indicato in letteratura come il metodo di estrazione delle feature più performante nei problemi di query expansion; le misurazioni effettuate sono tipiche dell’information retrieval: precisione a vari livelli, mean average precision, binary preference, R-precision. La valutazione di tali quantità è stata effettuata utilizzando un apposito tool messo a disposizione per la conferenza TREC. I risultati ottenuti sono molto incoraggianti. Un ulteriore campo applicativo in cui la struttura è stata validata è quello della categorizzazione dei documenti. In questo caso, la struttura WWP abbinata ad un modulo di Information Retrieval è utilizzata per implementare un document-ranking text classifier. Un classificatore di questo tipo realizza una soft decision ovvero non fornisce in ouput l’appartenenza di un documento ad una determinata classe ma redige un ranking di documenti che richiede la scelta di un opportuna soglia (Categorization Status Value threshold) per consentire la classificazione vera e propria. Tale soglia è stata scelta valutando le performance del classificatore in termini di micro-precision, micro-recall e micro-F1 rispetto al dataset utilizzato. Quest’ultimo, noto in letteratura come Reuters-21578, è costituito da circa 21 mila articoli di giornale; il sottoinsieme utilizzato, noto come ModApte split, include i soli documenti classificati manualmente da umani (10 categorie). La sperimentazione è stata condotta selezionando l’1% in maniera random del training set di ciascuna categoria e tale selezione è stata effettuata 100 volte in modo che i risultati non fossero polarizzati dallo specifico sottoinsieme. Le performance, valutate mediante calcolo della misura F1 (media armonica di precisione e richiamo), sono state confrontate con le Support Vector Machines, in letteratura indicate come stato dell’arte nella classificazione del dataset impiegato. I risultati ottenuti mostrano che quando il training set è ridotto al 1%, le performance del classificatore basato su WWP sono mediamente superiori a quelle delle SVM. I risultati ottenuti dall'impiego della struttura WWP nei campi di Text Retrieval e Text Mining sono molto interessanti e stanno ottenendo buon riscontro da parte della comunità scientifica. Dal punto di vista delle prospettive future, essendo attualmente la struttura appresa dai soli esempi positivi, potrebbe essere interessante valutare l'incremento di performance ottenuto impiegando 2 strutture WWP apprese rispettivamente da esempi positivi e negativi. Naturalmente, trattandosi di un classificatore soft decision, diventa cruciale stabilire una corretta politica di combinazione tra i ranking ottenuti dall'impiego del WWP "positivo" e quello negativo e la scelta della soglia CSV. Un altro interessante spunto futuro riguarda la costruzione di ontologie complete da strutture WWP, che richiederebbe l'identificazione delle tipologie di relazioni esistenti tra i termini mediante ausilio di conoscenza esogena (WordNet, etc...). [a cura dell'autore]
XI n.s.

APA, Harvard, Vancouver, ISO, and other styles

33

Miller, Daniel. "A System for Natural Language Unmarked Clausal Transformations in Text-to-Text Applications." DigitalCommons@CalPoly, 2009. https://digitalcommons.calpoly.edu/theses/137.

Full text

Abstract:

A system is proposed which separates clauses from complex sentences into simpler stand-alone sentences. This is useful as an initial step on raw text, where the resulting processed text may be fed into text-to-text applications such as Automatic Summarization, Question Answering, and Machine Translation, where complex sentences are difficult to process. Grammatical natural language transformations provide a possible method to simplify complex sentences to enhance the results of text-to-text applications. Using shallow parsing, this system improves the performance of existing systems to identify and separate marked and unmarked embedded clauses in complex sentence structure resulting in syntactically simplified source for further processing.

APA, Harvard, Vancouver, ISO, and other styles

34

McMurtry, William F. "Information Retrieval for Call Center Quality Assurance." The Ohio State University, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=osu1587036885211228.

Full text

APA, Harvard, Vancouver, ISO, and other styles

35

Watanabe, Yasuhiko. "Integrated Analysis of Image, Diagram, and Text for Multimedia Document Retrieval." 京都大学 (Kyoto University), 2002. http://hdl.handle.net/2433/149384.

Full text

APA, Harvard, Vancouver, ISO, and other styles

36

Nag, Chowdhury Sreyasi [Verfasser]. "Text-image synergy for multimodal retrieval and annotation / Sreyasi Nag Chowdhury." Saarbrücken : Saarländische Universitäts- und Landesbibliothek, 2021. http://d-nb.info/1240674139/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

37

Leng, Chun-Wu. "Design and performance evaluation of signature files for text retrieval systems /." The Ohio State University, 1990. http://rave.ohiolink.edu/etdc/view?acc_num=osu1487685204967976.

Full text

APA, Harvard, Vancouver, ISO, and other styles

38

Moreno, José G. "Text-Based Ephemeral Clustering for Web Image Retrieval on Mobile Devices." Caen, 2014. http://www.theses.fr/2014CAEN2036.

Full text

Abstract:

Dans cette thèse, nous présentons une étude sur la visualisation des résultats Web d'images sur les dispositifs nomades. Nos principales conclusions ont été inspirées par les avancées récentes dans deux principaux domaines de recherche – la recherche d'information et le traitement automatique du langage naturel. Tout d’abord, nous avons examiné différents sujets tels que le regroupement des résultats Web, les interfaces mobiles, la fouille des intentions sur une requête, pour n'en nommer que quelques-uns. Ensuite, nous nous sommes concentré sur les mesures d'association lexical, les métriques de similarité d'ordre élevé, etc. Notamment afin de valider notre hypothèse, nous avons réalisé différentes expériences avec des jeux de données spécifiques de la tâche. De nombreuses caractéristiques sont évaluées dans les solutions proposées. Premièrement, la qualité de regroupement en utilisant à la fois des métriques d'évaluation classiques, mais aussi des métriques plus récentes. Deuxièmement, la qualité de l'étiquetage de chaque groupe de documents est évaluée pour s'assurer au maximum que toutes les intentions des requêtes sont couvertes. Finalement, nous évaluons l'effort de l'utilisateur à explorer les images dans une interface basée sur l'utilisation des galeries présentées sur des dispositifs nomades. Un chapitre entier est consacré à chacun de ces trois aspects dans lesquels les jeux de données - certains d'entre eux construits pour évaluer des caractéristiques spécifiques - sont présentés. Comme résultats de cette thèse, nous sommes développés : deux algorithmes adaptés aux caractéristiques du problème, deux jeux de données pour les tâches respectives et un outil d'évaluation pour le regroupement des résultats d'une requête (SRC pour les sigles en anglais). Concernant les algorithmes, Dual C-means est notre principal contribution. Il peut être vu comme une généralisation de notre algorithme développé précédemment, l'AGK-means. Les deux sont basés sur des mesures d'association lexical à partir des résultats Web. Un nouveau jeu de données pour l'évaluation complète d'algorithmes SRC est élaboré et présenté. De même, un nouvel ensemble de données sur les images Web est développé et utilisé avec une nouvelle métrique à fin d'évaluer l'effort fait pour les utilisateurs lors qu'ils explorent un ensemble d'images. Enfin, nous avons développé un outil d'évaluation pour le problème SRC, dans lequel nous avons mis en place plusieurs mesures classiques et récentes utilisées en SRC. Nos conclusions sont tirées compte tenu des nombreux facteurs qui ont été discutés dans cette thèse. Cependant, motivés par nos conclusions, des études supplémentaires pourraient être développés. Celles-ci sont discutées à la fin de ce manuscrit et notre résultats préliminaires suggère que l’association de plusieurs sources d'information améliore déjà la qualité du regroupement
In this thesis, we present a study about Web image results visualization on mobile devices. Our main findings were inspired by the recent advances in two main research areas - Information Retrieval and Natural Language Processing. In the former, we considered different topics such as search results clustering, Web mobile interfaces, query intent mining, to name but a few. In the latter, we were more focused in collocation measures, high order similarity metrics, etc. Particularly in order to validate our hypothesis, we performed a great deal of different experiments with task specific datasets. Many characteristics are evaluated in the proposed solutions. First, the clustering quality in which classical and recent evaluation metrics are considered. Secondly, the labeling quality of each cluster is evaluated to make sure that all possible query intents are covered. Thirdly and finally, we evaluate the user's effort in exploring the images in a gallery-based interface. An entire chapter is dedicated to each of these three aspects in which the datasets - some of them built to evaluate specific characteristics - are presented. For the final results, we can take into account two developed algorithms, two datasets and a SRC evaluation tool. From the algorithms, Dual C-means is our main product. It can be seen as a generalization of our previously developed algorithm, the AGK-means. Both are based in text-based similarity metrics. A new dataset for a complete evaluation of SRC algorithms is developed and presented. Similarly, a new Web image dataset is developed and used together with a new metric to measure the users effort when a set of Web images is explored. Finally, we developed an evaluation tool for the SRC problem, in which we have implemented several classical and recent SRC metrics. Our conclusions are drawn considering the numerous factors that were discussed in this thesis. However, additional studies could be motivated based in our findings. Some of them are discussed in the end of this study and preliminary analysis suggest that they are directions that have potential

APA, Harvard, Vancouver, ISO, and other styles

39

Larsson, Jimmy. "Taxonomy Based Image Retrieval : Taxonomy Based Image Retrieval using Data from Multiple Sources." Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-180574.

Full text

Abstract:

With a multitude of images available on the Internet, how do we find what we are looking for? This project tries to determine how much the precision and recall of search queries is improved by using a word taxonomy on traditional Text-Based Image Search and Content-Based Image Search. By applying a word taxonomy to different data sources, a strong keyword filter and a keyword extender were implemented and tested. The results show that depending on the implementation, the precision or the recall can be increased. By using a similar approach on real life implementations, it is possible to force images with higher precisions to the front while keeping a high recall value, thus increasing the experienced relevance of image search.
Med den mängd bilder som nu finns tillgänglig på Internet, hur kan vi fortfarande hitta det vi letar efter? Denna uppsats försöker avgöra hur mycket bildprecision och bildåterkallning kan öka med hjälp av appliceringen av en ordtaxonomi på traditionell Text-Based Image Search och Content-Based Image Search. Genom att applicera en ordtaxonomi på olika datakällor kan ett starkt ordfilter samt en modul som förlänger ordlistor skapas och testas. Resultaten pekar på att beroende på implementationen så kan antingen precisionen eller återkallningen förbättras. Genom att använda en liknande metod i ett verkligt scenario är det därför möjligt att flytta bilder med hög precision längre fram i resultatlistan och samtidigt behålla hög återkallning, och därmed öka den upplevda relevansen i bildsök.

APA, Harvard, Vancouver, ISO, and other styles

40

Williams, Ken. "A framework for text categorization." Thesis, The University of Sydney, 2003. https://hdl.handle.net/2123/27951.

Full text

Abstract:

The field of automatic Text Categorization (TC) concerns the creation of categorizer functions, usually involving Machine Learning techniques, to assign labels from a pre-defined set of categories to documents based on the documents' content. Because of the many variations on how this can be achieved and the diversity of applications in which it can be employed, creating specific TC applications is often a difficult task. This thesis concerns the design, implementation, and testing of an ObjectOriented Application Framework for Text Categorization. By encoding expertise in the architecture of the framework, many of the barriers to creating TC applications are eliminated. Developers can focus on the domain-specific aspects of their applications, leaving the generic aspects of categorization to the framework. This allows significant code and design reuse when building new applications. Chapter 1 provides an introduction to automatic Text Categorization, Object-Oriented Application Frameworks, and Design Patterns. Some common application areas and benefits of using automatic TC are discussed. Frameworks are defined and their advantages compared to other software engineering strategies are presented. Design patterns are defined and placed in the context of framework development. An overview of three related products in the TC space, Weka, Autonomy, and Teragram, follows. Chapter 2 contains a detailed presentation of Text Categorization. TC is formally defined, followed by a detailed account of the main functional areas in Text Categorization that a modern TC framework must provide. These include document tokenizing, feature selection and reduction, Machine Learning techniques, and categorization runtime behavior. Four Machine Learning techniques (Na"ive Bayes categorizers, k-Nearest-Neighbor categorizers, Support Vector Machines, and Decision Trees) are presented, with discussions of their core algorithms and the computational complexity involved. Several measures for evaluating the quality of a categorizer are then defined, including precision, recall, and the Ff3 measure. The design of a framework that addresses the functional areas from Chapter 2 is presented in Chapter 3. This design is motivated by consideration of the framework's audience and some expected usage scenarios. The core architectural classes in the framework are then presented, and Design Patterns are employed in a detailed discussion of the cooperative relationships among framework classes. This is the first known use of Design Patterns in an academic work on Text Categorization software. Following the presentation of the framework design, some possible design limitations are discussed. The design in Chapter 3 has been implemented as the AI: : Categorizer Perl package. Chapter 4 is a short discussion of implementation issues, including considerations in choosing the programming language. Special consideration is given to the implementation of constructor methods in the framework, since they are responsible for enforcing the structural relationships among framework classes. Three data structure issues within the framework are then discussed: feature vectors, sets of document or category objects, and the serialized representation of a framework object. Chapter 5 evaluates the framework from several different perspectives on two corpora. The first corpus is the standard Reuters-21578 benchmark corpus, and the second is assembled from messages sent to an educational ask-an-expert service. Using these corpora, the framework is evaluated on the measures introduced in Chapter 2. The performance on the first corpus is compared to the well-known results in [50]. The Nai·ve Bayes categorizer is found to be competitive with standard implementations in the literature, and the Support Vector Machine and k-Nearest-Neighbor implementations are outperformed by comparable systems by other researchers. The framework is then evaluated in terms of its resource usage, and several applications using AI: : Categorizer are presented in order to show the framework's ability to function in the usage scenarios discussed in Chapter 3.

APA, Harvard, Vancouver, ISO, and other styles

41

Katsarona, Stavros. "SCANTRAX - an associative string processor for relational database management and text retrieval." Thesis, Brunel University, 1987. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.292397.

Full text

APA, Harvard, Vancouver, ISO, and other styles

42

Martins, Bruno. "Geographically Aware Web Text Mining." Master's thesis, Department of Informatics, University of Lisbon, 2009. http://hdl.handle.net/10451/14301.

Full text

Abstract:

Text mining and search have become important research areas over the past few years, mostly due to the large popularity of the Web. A natural extension for these technologies is the development of methods for exploring the geographic context of Web information. Human information needs often present specific geographic constraints. Many Web documents also refer to speci c locations. However, relatively little e ort has been spent on developing the facilities required for geographic access to unstructured textual information. Geographically aware text mining and search remain relatively unexplored. This thesis addresses this new area, arguing that Web text mining can be applied to extract geographic context information, and that this information can be explored for information retrieval. Fundamental questions investigated include handling geographic references in text, assigning geographic scopes to the documents, and building retrieval applications that handle/use geographic scopes. The thesis presents appropriate solutions for each of these challenges, together with a comprehensive evaluation of their efectiveness. By investigating these questions, the thesis presents several findings on how the geographic context can be efectively handled by text processing tools.

APA, Harvard, Vancouver, ISO, and other styles

43

Dunning, Ted Emerson. "Finding structure in text, genome and other symbolic sequences." Thesis, University of Sheffield, 1998. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.310811.

Full text

APA, Harvard, Vancouver, ISO, and other styles

44

Tursun, Osman. "Missing ingredients in optimising large-scale image retrieval with deep features." Thesis, Queensland University of Technology, 2022. https://eprints.qut.edu.au/227803/1/Osman_Tursun_Thesis.pdf.

Full text

Abstract:

This thesis applies advanced image processing and deep machine learning techniques to solve the challenges of large-scale image retrieval. Solutions are provided to overcome key obstacles in real-world large-scale image retrieval applications by introducing unique methods for making deep learning systems more reliable and efficient. The outcome of the research is useful for several image retrieval applications including patent search, and trademark and logo infringement analysis.

APA, Harvard, Vancouver, ISO, and other styles

45

Leidner, Jochen Lothar. "Toponym resolution in text." Thesis, University of Edinburgh, 2007. http://hdl.handle.net/1842/1849.

Full text

Abstract:

Background. In the area of Geographic Information Systems (GIS), a shared discipline between informatics and geography, the term geo-parsing is used to describe the process of identifying names in text, which in computational linguistics is known as named entity recognition and classification (NERC). The term geo-coding is used for the task of mapping from implicitly geo-referenced datasets (such as structured address records) to explicitly geo-referenced representations (e.g., using latitude and longitude). However, present-day GIS systems provide no automatic geo-coding functionality for unstructured text. In Information Extraction (IE), processing of named entities in text has traditionally been seen as a two-step process comprising a flat text span recognition sub-task and an atomic classification sub-task; relating the text span to a model of the world has been ignored by evaluations such as MUC or ACE (Chinchor (1998); U.S. NIST (2003)). However, spatial and temporal expressions refer to events in space-time, and the grounding of events is a precondition for accurate reasoning. Thus, automatic grounding can improve many applications such as automatic map drawing (e.g. for choosing a focus) and question answering (e.g. for questions like How far is London from Edinburgh?, given a story in which both occur and can be resolved). Whereas temporal grounding has received considerable attention in the recent past (Mani and Wilson (2000); Setzer (2001)), robust spatial grounding has long been neglected. Concentrating on geographic names for populated places, I define the task of automatic Toponym Resolution (TR) as computing the mapping from occurrences of names for places as found in a text to a representation of the extensional semantics of the location referred to (its referent), such as a geographic latitude/longitude footprint. The task of mapping from names to locations is hard due to insufficient and noisy databases, and a large degree of ambiguity: common words need to be distinguished from proper names (geo/non-geo ambiguity), and the mapping between names and locations is ambiguous (London can refer to the capital of the UK or to London, Ontario, Canada, or to about forty other Londons on earth). In addition, names of places and the boundaries referred to change over time, and databases are incomplete. Objective. I investigate how referentially ambiguous spatial named entities can be grounded, or resolved, with respect to an extensional coordinate model robustly on open-domain news text. I begin by comparing the few algorithms proposed in the literature, and, comparing semiformal, reconstructed descriptions of them, I factor out a shared repertoire of linguistic heuristics (e.g. rules, patterns) and extra-linguistic knowledge sources (e.g. population sizes). I then investigate how to combine these sources of evidence to obtain a superior method. I also investigate the noise effect introduced by the named entity tagging step that toponym resolution relies on in a sequential system pipeline architecture. Scope. In this thesis, I investigate a present-day snapshot of terrestrial geography as represented in the gazetteer defined and, accordingly, a collection of present-day news text. I limit the investigation to populated places; geo-coding of artifact names (e.g. airports or bridges), compositional geographic descriptions (e.g. 40 miles SW of London, near Berlin), for instance, is not attempted. Historic change is a major factor affecting gazetteer construction and ultimately toponym resolution. However, this is beyond the scope of this thesis. Method. While a small number of previous attempts have been made to solve the toponym resolution problem, these were either not evaluated, or evaluation was done by manual inspection of system output instead of curating a reusable reference corpus. Since the relevant literature is scattered across several disciplines (GIS, digital libraries, information retrieval, natural language processing) and descriptions of algorithms are mostly given in informal prose, I attempt to systematically describe them and aim at a reconstruction in a uniform, semi-formal pseudo-code notation for easier re-implementation. A systematic comparison leads to an inventory of heuristics and other sources of evidence. In order to carry out a comparative evaluation procedure, an evaluation resource is required. Unfortunately, to date no gold standard has been curated in the research community. To this end, a reference gazetteer and an associated novel reference corpus with human-labeled referent annotation are created. These are subsequently used to benchmark a selection of the reconstructed algorithms and a novel re-combination of the heuristics catalogued in the inventory. I then compare the performance of the same TR algorithms under three different conditions, namely applying it to the (i) output of human named entity annotation, (ii) automatic annotation using an existing Maximum Entropy sequence tagging model, and (iii) a na¨ıve toponym lookup procedure in a gazetteer. Evaluation. The algorithms implemented in this thesis are evaluated in an intrinsic or component evaluation. To this end, we define a task-specific matching criterion to be used with traditional Precision (P) and Recall (R) evaluation metrics. This matching criterion is lenient with respect to numerical gazetteer imprecision in situations where one toponym instance is marked up with different gazetteer entries in the gold standard and the test set, respectively, but where these refer to the same candidate referent, caused by multiple near-duplicate entries in the reference gazetteer. Main Contributions. The major contributions of this thesis are as follows: • A new reference corpus in which instances of location named entities have been manually annotated with spatial grounding information for populated places, and an associated reference gazetteer, from which the assigned candidate referents are chosen. This reference gazetteer provides numerical latitude/longitude coordinates (such as 51320 North, 0 50 West) as well as hierarchical path descriptions (such as London > UK) with respect to a world wide-coverage, geographic taxonomy constructed by combining several large, but noisy gazetteers. This corpus contains news stories and comprises two sub-corpora, a subset of the REUTERS RCV1 news corpus used for the CoNLL shared task (Tjong Kim Sang and De Meulder (2003)), and a subset of the Fourth Message Understanding Contest (MUC-4; Chinchor (1995)), both available pre-annotated with gold-standard. This corpus will be made available as a reference evaluation resource; • a new method and implemented system to resolve toponyms that is capable of robustly processing unseen text (open-domain online newswire text) and grounding toponym instances in an extensional model using longitude and latitude coordinates and hierarchical path descriptions, using internal (textual) and external (gazetteer) evidence; • an empirical analysis of the relative utility of various heuristic biases and other sources of evidence with respect to the toponym resolution task when analysing free news genre text; • a comparison between a replicated method as described in the literature, which functions as a baseline, and a novel algorithm based on minimality heuristics; and • several exemplary prototypical applications to show how the resulting toponym resolution methods can be used to create visual surrogates for news stories, a geographic exploration tool for news browsing, geographically-aware document retrieval and to answer spatial questions (How far...?) in an open-domain question answering system. These applications only have demonstrative character, as a thorough quantitative, task-based (extrinsic) evaluation of the utility of automatic toponym resolution is beyond the scope of this thesis and left for future work.

APA, Harvard, Vancouver, ISO, and other styles

46

Albathan, Mubarak Murdi M. "Enhancement of relevant features for text mining." Thesis, Queensland University of Technology, 2015. https://eprints.qut.edu.au/90072/1/Mubarak%20Murdi%20M_Albathan_Thesis.pdf.

Full text

Abstract:

With the explosion of information resources, there is an imminent need to understand interesting text features or topics in massive text information. This thesis proposes a theoretical model to accurately weight specific text features, such as patterns and n-grams. The proposed model achieves impressive performance in two data collections, Reuters Corpus Volume 1 (RCV1) and Reuters 21578.

APA, Harvard, Vancouver, ISO, and other styles

47

Alsaad, Amal. "Enhanced root extraction and document classification algorithm for Arabic text." Thesis, Brunel University, 2016. http://bura.brunel.ac.uk/handle/2438/13510.

Full text

Abstract:

Many text extraction and classification systems have been developed for English and other international languages; most of the languages are based on Roman letters. However, Arabic language is one of the difficult languages which have special rules and morphology. Not many systems have been developed for Arabic text categorization. Arabic language is one of the Semitic languages with morphology that is more complicated than English. Due to its complex morphology, there is a need for pre-processing routines to extract the roots of the words then classify them according to the group of acts or meaning. In this thesis, a system has been developed and tested for text classification. The system is based on two stages, the first is to extract the roots from text and the second is to classify the text according to predefined categories. The linguistic root extraction stage is composed of two main phases. The first phase is to handle removal of affixes including prefixes, suffixes and infixes. Prefixes and suffixes are removed depending on the length of the word, while checking its morphological pattern after each deduction to remove infixes. In the second phase, the root extraction algorithm is formulated to handle weak, defined, eliminated-long-vowel and two-letter geminated words, as there is a substantial great amount of irregular Arabic words in texts. Once the roots are extracted, they are checked against a predefined list of 3800 triliteral and 900 quad literal roots. Series of experiments has been conducted to improve and test the performance of the proposed algorithm. The obtained results revealed that the developed algorithm has better accuracy than the existing stemming algorithm. The second stage is the document classification stage. In this stage two non-parametric classifiers are tested, namely Artificial Neural Networks (ANN) and Support Vector Machine (SVM). The system is trained on 6 categories: culture, economy, international, local, religion and sports. The system is trained on 80% of the available data. From each category, the 10 top frequent terms are selected as features. Testing the classification algorithms has been done on the remaining 20% of the documents. The results of ANN and SVM are compared to the standard method used for text classification, the terms frequency-based method. Results show that ANN and SVM have better accuracy (80-90%) compared to the standard method (60-70%). The proposed method proves the ability to categorize the Arabic text documents into the appropriate categories with a high precision rate.

APA, Harvard, Vancouver, ISO, and other styles

48

Lu, Zhiyong. "Text mining on GeneRIFs /." Connect to full text via ProQuest. Limited to UCD Anschutz Medical Campus, 2007.

Find full text

Abstract:

Thesis (Ph.D. in ) -- University of Colorado Denver, 2007.
Typescript. Includes bibliographical references (leaves 174-182). Free to UCD affiliates. Online version available via ProQuest Digital Dissertations;

APA, Harvard, Vancouver, ISO, and other styles

49

Goyal, Pawan. "Analytic knowledge discovery techniques for ad-hoc information retrieval and automatic text summarization." Thesis, Ulster University, 2011. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.543897.

Full text

Abstract:

Information retrieval is broadly concerned with the problem of automated searching for information within some document repository to support various information requests by users. The traditional retrieval frameworks work on the simplistic assumptions of “word independence” and “bag-of-words”, giving rise to problems such as “term mismatch” and “context independent document indexing”. Automatic text summarization systems, which use the same paradigm as that of information retrieval, also suffer from these problems. The concept of “semantic relevance” has also not been formulated in the existing literature. This thesis presents a detailed investigation of the knowledge discovery models and proposes new approaches to address these issues. The traditional retrieval frameworks do not succeed in defining the document content fully because they do not process the concepts in the documents; only the words are processed. To address this issue, a document retrieval model has been proposed using concept hierarchies, learnt automatically from a corpora. A novel approach to give a meaningful representation to the concept nodes in a learnt hierarchy has been proposed using a fuzzy logic based soft least upper bound method. A novel approach of adapting the vector space model with dependency parse relations for information retrieval also has been developed. A user query for information retrieval (IR) applications may not contain the most appropriate terms (words) as actually intended by the user. This is usually referred to as the term mismatch problem and is a crucial research issue in IR. To address this issue, a theoretical framework for Query Representation (QR) has been developed through a comprehensive theoretical analysis of a parametric query vector. A lexical association function has been derived analytically using the relevance criteria. The proposed QR model expands the user query using this association function. A novel term association metric has been derived using the Bernoulli model of randomness. x The derived metric has been used to develop a Bernoulli Query Expansion (BQE) model. The Bernoulli model of randomness has also been extended to the pseudo relevance feedback problem by proposing a Bernoulli Pseudo Relevance (BPR) model. In the traditional retrieval frameworks, the context in which a term occurs is mostly overlooked in assigning its indexing weight. This results in context independent document indexing. To address this issue, a novel Neighborhood Based Document Smoothing (NBDS) model has been proposed, which uses the lexical association between terms to provide a context sensitive indexing weight to the document terms, i.e. the term weights are redistributed based on the lexical association with the context words. To address the “context independent document indexing” for sentence extraction based text summarization task, a lexical association measure derived using the Bernoulli model of randomness has been used. A new approach using the lexical association between terms has been proposed to give a context sensitive weight to the document terms and these weights have been used for the sentence extraction task. Developed analytically, the proposed QR, BQE, BPR and NBDS models provide a proper mathematical framework for query expansion and document smoothing techniques, which have largely been heuristic in the existing literature. Being developed in the generalized retrieval framework, as also proposed in this thesis, these models are applicable to all of the retrieval frameworks. These models have been empirically evaluated over the benchmark TREC datasets and have been shown to provide significantly better performance than the baseline retrieval frameworks to a large degree, without adding significant computational or storage burden. The Bernoulli model applied to the sentence extraction task has also been shown to enhance the performance of the baseline text summarization systems over the benchmark DUC datasets. The theoretical foundations alongwith the empirical results verify that the proposed knowledge discovery models in this thesis advance the state of the art in the field of information retrieval and automatic text summarization.

APA, Harvard, Vancouver, ISO, and other styles

50

Smith, Owen John Robert. "An experiment on integration of Hypertext within a multi-user text retrieval system." Thesis, Queen's University Belfast, 1992. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.317527.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Text retrieval'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles