Dissertations / Theses: 'Text indexing'

1

He, Meng. "Indexing Compressed Text." Thesis, University of Waterloo, 2003. http://hdl.handle.net/10012/1143.

Abstract:

As a result of the rapid growth of the volume of electronic data, text compression and indexing techniques are receiving more and more attention. These two issues are usually treated as independent problems, but approaches of combining them have recently attracted the attention of researchers. In this thesis, we review and test some of the more effective and some of the more theoretically interesting techniques. Various compression and indexing techniques are presented, and we also present two compressed text indices. Based on these techniques, we implement an compressed full-text index, so that compressed texts can be indexed to support fast queries without decompressing the whole texts. The experiments show that our index is compact and supports fast search.

APA, Harvard, Vancouver, ISO, and other styles

2

Sani, Sadiq. "Role of semantic indexing for text classification." Thesis, Robert Gordon University, 2014. http://hdl.handle.net/10059/1133.

Full text

Abstract:

The Vector Space Model (VSM) of text representation suffers a number of limitations for text classification. Firstly, the VSM is based on the Bag-Of-Words (BOW) assumption where terms from the indexing vocabulary are treated independently of one another. However, the expressiveness of natural language means that lexically different terms often have related or even identical meanings. Thus, failure to take into account the semantic relatedness between terms means that document similarity is not properly captured in the VSM. To address this problem, semantic indexing approaches have been proposed for modelling the semantic relatedness between terms in document representations. Accordingly, in this thesis, we empirically review the impact of semantic indexing on text classification. This empirical review allows us to answer one important question: how beneficial is semantic indexing to text classification performance. We also carry out a detailed analysis of the semantic indexing process which allows us to identify reasons why semantic indexing may lead to poor text classification performance. Based on our findings, we propose a semantic indexing framework called Relevance Weighted Semantic Indexing (RWSI) that addresses the limitations identified in our analysis. RWSI uses relevance weights of terms to improve the semantic indexing of documents. A second problem with the VSM is the lack of supervision in the process of creating document representations. This arises from the fact that the VSM was originally designed for unsupervised document retrieval. An important feature of effective document representations is the ability to discriminate between relevant and non-relevant documents. For text classification, relevance information is explicitly available in the form of document class labels. Thus, more effective document vectors can be derived in a supervised manner by taking advantage of available class knowledge. Accordingly, we investigate approaches for utilising class knowledge for supervised indexing of documents. Firstly, we demonstrate how the RWSI framework can be utilised for assigning supervised weights to terms for supervised document indexing. Secondly, we present an approach called Supervised Sub-Spacing (S3) for supervised semantic indexing of documents. A further limitation of the standard VSM is that an indexing vocabulary that consists only of terms from the document collection is used for document representation. This is based on the assumption that terms alone are sufficient to model the meaning of text documents. However for certain classification tasks, terms are insufficient to adequately model the semantics needed for accurate document classification. A solution is to index documents using semantically rich concepts. Accordingly, we present an event extraction framework called Rule-Based Event Extractor (RUBEE) for identifying and utilising event information for concept-based indexing of incident reports. We also demonstrate how certain attributes of these events e.g. negation, can be taken into consideration to distinguish between documents that describe the occurrence of an event, and those that mention the non-occurrence of that event.

APA, Harvard, Vancouver, ISO, and other styles

3

Bowden, Paul Richard. "Automated knowledge extraction from text." Thesis, Nottingham Trent University, 1999. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.298900.

Full text

APA, Harvard, Vancouver, ISO, and other styles

4

Mick, Alan A. "Knowledge based text indexing and retrieval utilizing case based reasoning /." Online version of thesis, 1994. http://hdl.handle.net/1850/11715.

Full text

APA, Harvard, Vancouver, ISO, and other styles

5

Lester, Nicholas, and nml@cs rmit edu au. "Efficient Index Maintenance for Text Databases." RMIT University. Computer Science and Information Technology, 2006. http://adt.lib.rmit.edu.au/adt/public/adt-VIT20070214.154933.

Full text

Abstract:

All practical text search systems use inverted indexes to quickly resolve user queries. Offline index construction algorithms, where queries are not accepted during construction, have been the subject of much prior research. As a result, current techniques can invert virtually unlimited amounts of text in limited main memory, making efficient use of both time and disk space. However, these algorithms assume that the collection does not change during the use of the index. This thesis examines the task of index maintenance, the problem of adapting an inverted index to reflect changes in the collection it describes. Existing approaches to index maintenance are discussed, including proposed optimisations. We present analysis and empirical evidence suggesting that existing maintenance algorithms either scale poorly to large collections, or significantly degrade query resolution speed. In addition, we propose a new strategy for index maintenance that trades a strictly controlled amount of querying efficiency for greatly increased maintenance speed and scalability. Analysis and empirical results are presented that show that this new algorithm is a useful trade-off between indexing and querying efficiency. In scenarios described in Chapter 7, the use of the new maintenance algorithm reduces the time required to construct an index to under one sixth of the time taken by algorithms that maintain contiguous inverted lists. In addition to work on index maintenance, we present a new technique for accumulator pruning during ranked query evaluation, as well as providing evidence that existing approaches are unsatisfactory for collections of large size. Accumulator pruning is a key problem in both querying efficiency and overall text search system efficiency. Existing approaches either fail to bound the memory footprint required for query evaluation, or suffer loss of retrieval accuracy. In contrast, the new pruning algorithm can be used to limit the memory footprint of ranked query evaluation, and in our experiments gives retrieval accuracy not worse than previous alternatives. The results presented in this thesis are validated with robust experiments, which utilise collections of significant size, containing real data, and tested using appropriate numbers of real queries. The techniques presented in this thesis allow information retrieval applications to efficiently index and search changing collections, a task that has been historically problematic.

APA, Harvard, Vancouver, ISO, and other styles

6

Chung, EunKyung. "A Framework of Automatic Subject Term Assignment: An Indexing Conception-Based Approach." Thesis, University of North Texas, 2006. https://digital.library.unt.edu/ark:/67531/metadc5473/.

Full text

Abstract:

The purpose of dissertation is to examine whether the understandings of subject indexing processes conducted by human indexers have a positive impact on the effectiveness of automatic subject term assignment through text categorization (TC). More specifically, human indexers' subject indexing approaches or conceptions in conjunction with semantic sources were explored in the context of a typical scientific journal article data set. Based on the premise that subject indexing approaches or conceptions with semantic sources are important for automatic subject term assignment through TC, this study proposed an indexing conception-based framework. For the purpose of this study, three hypotheses were tested: 1) the effectiveness of semantic sources, 2) the effectiveness of an indexing conception-based framework, and 3) the effectiveness of each of three indexing conception-based approaches (the content-oriented, the document-oriented, and the domain-oriented approaches). The experiments were conducted using a support vector machine implementation in WEKA (Witten, & Frank, 2000). The experiment results pointed out that cited works, source title, and title were as effective as the full text, while keyword was found more effective than the full text. In addition, the findings showed that an indexing conception-based framework was more effective than the full text. Especially, the content-oriented and the document-oriented indexing approaches were found more effective than the full text. Among three indexing conception-based approaches, the content-oriented approach and the document-oriented approach were more effective than the domain-oriented approach. In other words, in the context of a typical scientific journal article data set, the objective contents and authors' intentions were more focused that the possible users' needs. The research findings of this study support that incorporation of human indexers' indexing approaches or conception in conjunction with semantic sources has a positive impact on the effectiveness of automatic subject term assignment.

APA, Harvard, Vancouver, ISO, and other styles

7

Haouam, Kamel Eddine. "RVSM A rhetorical conceptual model for content-based indexing and retrieval of text document." Thesis, London Metropolitan University, 2010. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.517132.

Full text

APA, Harvard, Vancouver, ISO, and other styles

8

Zhu, Weizhong Allen Robert B. "Text clustering and active learning using a LSI subspace signature model and query expansion /." Philadelphia, Pa. : Drexel University, 2009. http://hdl.handle.net/1860/3077.

Full text

APA, Harvard, Vancouver, ISO, and other styles

9

Thachuk, Christopher Joseph. "Space and energy efficient molecular programming and space efficient text indexing methods for sequence alignment." Thesis, University of British Columbia, 2013. http://hdl.handle.net/2429/44172.

Full text

Abstract:

Nucleic acids play vital roles in the cell by virtue of the information encoded into their nucleotide sequence and the folded structures they form. Given their propensity to alter their shape over time under changing environmental conditions, an RNA molecule will fold through a series of structures called a folding pathway. As this is a thermodynamically-driven probabilistic process, folding pathways tend to avoid high energy structures and those which do are said to have a low energy barrier. In the first part of this thesis, we study the problem of predicting low energy barrier folding pathways of a nucleic acid strand. We show various restrictions of the problem are computationally intractable, unless P=NP. We propose an exact algorithm that has exponential worst-case runtime, but uses only polynomial space and performs well in practice. Motivated by recent applications in molecular programming we also consider a number of related problems that leverage folding pathways to perform computation. We show that verifying the correctness of these systems is PSPACE-hard and in doing so show that predicting low energy barrier folding pathways of multiple interacting strands is PSPACE-complete. We explore the computational limits of this class of molecular programs which are capable, in principle, of logically reversible and thus energy efficient computation. We demonstrate that a space and energy efficient molecular program of this class can be constructed to solve any problem in SPACE ---the class of all space-bounded problems. We prove a number of limits to deterministic and also to space efficient computation of molecular programs that leverage folding pathways, and show limits for more general classes. In the second part of this thesis, we continue the study of algorithms and data structures for predicting properties of nucleic acids, but with quite different motivations pertaining to sequence rather than structure. We design a number of compressed text indexes that improve pattern matching queries in light of common biological events such as single nucleotide polymorphisms in genomes and alternative splicing in transcriptomes. Our text indexes and associated algorithms have the potential for use in alignment of sequencing data to reference sequences.

APA, Harvard, Vancouver, ISO, and other styles

10

Hon, Wing-kai. "On the construction and application of compressed text indexes." Click to view the E-thesis via HKUTO, 2004. http://sunzi.lib.hku.hk/hkuto/record/B31059739.

Full text

APA, Harvard, Vancouver, ISO, and other styles

11

Hon, Wing-kai, and 韓永楷. "On the construction and application of compressed text indexes." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2004. http://hub.hku.hk/bib/B31059739.

Full text

APA, Harvard, Vancouver, ISO, and other styles

12

Geiss, Johanna. "Latent semantic sentence clustering for multi-document summarization." Thesis, University of Cambridge, 2011. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.609761.

Full text

APA, Harvard, Vancouver, ISO, and other styles

13

Ahlgren, Per. "The effects of indexing strategy-query term combination on retrieval effectiveness in a Swedish full text database." Doctoral thesis, University College of Borås, 2004. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-171411.

Full text

Abstract:

This thesis deals with Swedish full text retrieval and the problem of morphological variation of query terms in thedocument database. The study is an information retrieval experiment with a test collection. While no Swedish testcollection was available, such a collection was constructed. It consists of a document database containing 161,336news articles, and 52 topics with four-graded (0, 1, 2, 3) relevance assessments. The effects of indexing strategy-query term combination on retrieval effectiveness were studied. Three of five testedmethods involved indexing strategies that used conflation, in the form of normalization. Further, two of these threecombinations used indexing strategies that employed compound splitting. Normalization and compound splittingwere performed by SWETWOL, a morphological analyzer for the Swedish language. A fourth combinationattempted to group related terms by right hand truncation of query terms. A search expert performed the truncation.The four combinations were compared to each other and to a baseline combination, where no attempt was made tocounteract the problem of morphological variation of query terms in the document database. Two situations were examined in the evaluation: the binary relevance situation and the multiple degree relevancesituation. With regard to the binary relevance situation, where the three (positive) relevance degrees (1, 2, 3) weremerged into one, and where precision was used as evaluation measure, the four alternative combinationsoutperformed the baseline. The best performing combination was the combination that used truncation. Thiscombination performed better than or equal to a median precision value for 41 of the 52 topics. One reason for therelatively good performance of the truncation combination was the capacity of its queries to retrieve different partsof speech. In the multiple degree relevance situation, where the three (positive) relevance degrees were retained, retrievaleffectiveness was taken to be the accumulated gain the user receives by examining the retrieval result up to givenpositions. The evaluation measure used was nDCG (normalized cumulated gain with discount). This measurecredits retrieval methods that (1) rank highly relevant documents higher than less relevant ones, and (2) rankrelevant (of any degree) documents high. With respect to (2), nDCG involves a discount component: a discount withregard to the relevance score of a relevant (of any degree) document is performed, and this discount is greater andgreater, the higher position the document has in the ranked list of retrieved documents. In the multiple degree relevance situation, the five combinations were evaluated under four different user scenarios,where each scenario simulated a certain user type. Again, the four alternative combinations outperformed thebaseline, for each user scenario. The truncation combination had the best performance under each user scenario.This outcome agreed with the performance result in the binary relevance situation. However, there were alsodifferences between the two relevance situations. For 25 percent of the topics and with regard to one of the four userscenarios, the set of best performing combinations in the binary relevance situation was disjunct from the set of bestperforming combinations in the multiple degree relevance situation. The user scenario in question was such thatalmost all importance was placed on highly relevant documents, and the discount was sharp. The main conclusion of the thesis is that normalization and right hand truncation (performed by a search expert)enhanced retrieval effectiveness in comparison to the baseline, irrespective of which of the two relevance situationswe consider. Further, the three indexing strategy-query term combinations based on normalization were almost asgood as the combination that involves truncation. This holds for both relevance situations.

QC 20150813

APA, Harvard, Vancouver, ISO, and other styles

14

Tam, Wai I. "Compression, indexing and searching of a large structured-text database in a library monitoring and control system (LiMaCS)." Thesis, University of Macau, 1998. http://umaclib3.umac.mo/record=b1636991.

Full text

APA, Harvard, Vancouver, ISO, and other styles

15

Tsatsaronis, George. "An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition." Saechsische Landesbibliothek- Staats- und Universitaetsbibliothek Dresden, 2017. http://nbn-resolving.de/urn:nbn:de:bsz:14-qucosa-202687.

Full text

Abstract:

This article provides an overview of the first BioASQ challenge, a competition on large-scale biomedical semantic indexing and question answering (QA), which took place between March and September 2013. BioASQ assesses the ability of systems to semantically index very large numbers of biomedical scientific articles, and to return concise and user-understandable answers to given natural language questions by combining information from biomedical articles and ontologies.

APA, Harvard, Vancouver, ISO, and other styles

16

Tsatsaronis, George. "An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition." BioMed Central, 2015. https://tud.qucosa.de/id/qucosa%3A29496.

Full text

Abstract:

This article provides an overview of the first BioASQ challenge, a competition on large-scale biomedical semantic indexing and question answering (QA), which took place between March and September 2013. BioASQ assesses the ability of systems to semantically index very large numbers of biomedical scientific articles, and to return concise and user-understandable answers to given natural language questions by combining information from biomedical articles and ontologies.

APA, Harvard, Vancouver, ISO, and other styles

17

Skeppstedt, Maria. "Extracting Clinical Findings from Swedish Health Record Text." Doctoral thesis, Stockholms universitet, Institutionen för data- och systemvetenskap, 2014. http://urn.kb.se/resolve?urn=urn:nbn:se:su:diva-109254.

Full text

Abstract:

Information contained in the free text of health records is useful for the immediate care of patients as well as for medical knowledge creation. Advances in clinical language processing have made it possible to automatically extract this information, but most research has, until recently, been conducted on clinical text written in English. In this thesis, however, information extraction from Swedish clinical corpora is explored, particularly focusing on the extraction of clinical findings. Unlike most previous studies, Clinical Finding was divided into the two more granular sub-categories Finding (symptom/result of a medical examination) and Disorder (condition with an underlying pathological process). For detecting clinical findings mentioned in Swedish health record text, a machine learning model, trained on a corpus of manually annotated text, achieved results in line with the obtained inter-annotator agreement figures. The machine learning approach clearly outperformed an approach based on vocabulary mapping, showing that Swedish medical vocabularies are not extensive enough for the purpose of high-quality information extraction from clinical text. A rule and cue vocabulary-based approach was, however, successful for negation and uncertainty classification of detected clinical findings. Methods for facilitating expansion of medical vocabulary resources are particularly important for Swedish and other languages with less extensive vocabulary resources. The possibility of using distributional semantics, in the form of Random indexing, for semi-automatic vocabulary expansion of medical vocabularies was, therefore, evaluated. Distributional semantics does not require that terms or abbreviations are explicitly defined in the text, and it is, thereby, a method suitable for clinical corpora. Random indexing was shown useful for extending vocabularies with medical terms, as well as for extracting medical synonyms and abbreviation dictionaries.

APA, Harvard, Vancouver, ISO, and other styles

18

Tarczyńska, Anna. "Methods of Text Information Extraction in Digital Videos." Thesis, Blekinge Tekniska Högskola, Sektionen för datavetenskap och kommunikation, 2012. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-2656.

Full text

Abstract:

Context The huge amount of existing digital video files needs to provide indexing to make it available for customers (easier searching). The indexing can be provided by text information extraction. In this thesis we have analysed and compared methods of text information extraction in digital videos. Furthermore, we have evaluated them in the new context proposed by us, namely usefulness in sports news indexing and information retrieval. Objectives The objectives of this thesis are as follows: providing a better understanding of the nature of text extraction; performing a systematic literature review on various methods of text information extraction in digital videos of TV sports news; designing and executing an experiment in the testing environment; evaluating available and promising methods of text information extraction from digital video files in the proposed context associated with video sports news indexing and retrieval; providing an adequate solution in the proposed context described above. Methods This thesis consists of three research methods: Systematic Literature Review, Video Content Analysis with the checklist, and Experiment. The Systematic Literature Review has been used to study the nature of text information extraction, to establish the methods and challenges, and to specify the effective way of conducting the experiment. The video content analysis has been used to establish the context for the experiment. Finally, the experiment has been conducted to answer the main research question: How useful are the methods of text information extraction for indexation of video sports news and information retrieval? Results Through the Systematic Literature Review we identified 29 challenges of the text information extraction methods, and 10 chains between them. We extracted 21 tools and 105 different methods, and analyzed the relations between them. Through Video Content Analysis we specified three groups of probability of text extraction from video, and 14 categories for providing video sports news indexation with the taxonomy hierarchy. We have conducted the Experiment on three videos files, with 127 frames, 8970 characters, and 1814 words, using the only available MoCA tool. As a result, we reported 10 errors and proposed recommendations for each of them. We evaluated the tool according to the categories mentioned above and offered four advantages, and nine disadvantages of the Tool mentioned above. Conclusions It is hard to compare the methods described in the literature, because the tools are not available for testing, and they are not compared with each other. Furthermore, the values of recall and precision measures highly depend on the quality of the text contained in the video. Therefore, performing the experiments on the same indexed database is necessary. However, the text information extraction is time consuming (because of huge amount of frames in video), and even high character recognition rate gives low word recognition rate. Therefore, the usefulness of text information extraction for video indexation is still low. Because most of the text information contained in the videos news is inserted in post-processing, the text extraction could be provided in the root: during the processing of the original video, by the broadcasting company (e.g. by automatically saving inserted text in separate file). Then the text information extraction will not be necessary for managing the new video files
The huge amount of existing digital video files needs to provide indexing to make it available for customers (easier searching). The indexing can be provided by text information extraction. In this thesis we have analysed and compared methods of text information extraction in digital videos. Furthermore, we have evaluated them in the new context proposed by us, namely usefulness in sports news indexing and information retrieval.

APA, Harvard, Vancouver, ISO, and other styles

19

Hassel, Martin. "Resource Lean and Portable Automatic Text Summarization." Doctoral thesis, Stockholm : Numerisk analys och datalogi Numerical Analysis and Computer Science, Kungliga Tekniska högskolan, 2007. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-4414.

Full text

APA, Harvard, Vancouver, ISO, and other styles

20

Toth, Róbert. "Přibližné vyhledávání řetězců v předzpracovaných dokumentech." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2014. http://www.nusl.cz/ntk/nusl-236122.

Full text

Abstract:

This thesis deals with the problem of approximate string matching, also called string matching allowing errors. The thesis targets the area of offline algorithms, which allows very fast pattern matching thanks to index created during initial text preprocessing phase. Initially, we will define the problem itself and demonstrate variety of its applications, followed by short survey of different approaches to cope with this problem. Several existing algorithms based on suffix trees will be explained in detail and new hybrid algorithm will be proposed. Algorithms wil be implemented in C programming language and thoroughly compared in series of experiments with focus on newly presented algorithm.

APA, Harvard, Vancouver, ISO, and other styles

21

Zheng, Ning. "Discovering interpretable topics in free-style text diagnostics, rare topics, and topic supervision /." Columbus, Ohio : Ohio State University, 2008. http://rave.ohiolink.edu/etdc/view?acc%5Fnum=osu1199237529.

Full text

APA, Harvard, Vancouver, ISO, and other styles

22

Weldeghebriel, Zemichael Fesahatsion. "Evaluating and comparing search engines in retrieving text information from the web." Thesis, Stellenbosch : Stellenbosch University, 2004. http://hdl.handle.net/10019.1/53740.

Full text

Abstract:

Thesis (MPhil)--Stellenbosch University, 2004
ENGLISH ABSTRACT: With the introduction of the Internet and the World Wide Web (www), information can be easily accessed and retrieved from the web using information retrieval systems such as web search engines or simply search engines. There are a number of search engines that have been developed to provide access to the resources available on the web and to help users in retrieving relevant information from the web. In particular, they are essential for finding text information on the web for academic purposes. But, how effective and efficient are those search engines in retrieving the most relevant text information from the web? Which of the search engines are more effective and efficient? So, this study was conducted to see how effective and efficient search engines are and to see which search engines are most effective and efficient in retrieving the required text information from the web. It is very important to know the most effective and efficient search engines because such search engines can be used to retrieve a higher number of the most relevant text web pages with minimum time and effort. The study was based on nine major search engines, four search queries and relevancy judgments as relevant/partly-relevanUnon-relevant. Precision and recall were calculated based on the experimental or test results and these were used as basis for the statistical evaluation and comparisons of the retrieval effectiveness of the nine search engines. Duplicated items and broken links were also recorded and examined separately and were used as an additional measure of search engine effectiveness. A response time was also recorded and used as a base for the statistical evaluation and comparisons of the retrieval efficiency of the nine search engines. Additionally, since search engines involve indexing and searching in the information retrieval processes from the web, this study first discusses, from the theoretical point of view, how the indexing and searching processes are performed in an information retrieval environment. It also discusses the influences of indexing and searching processes on the effectiveness and efficiency of information retrieval systems in general and search engines in particular in retrieving the most relevant text information from the web.
AFRIKAANSE OPSOMMING: Met die koms van die Internet en die Wêreldwye Web (www) is inligting maklik bekombaar. Dit kan herwin word deur gebruik te maak van inligtingherwinningsisteme soos soekenjins. Daar is 'n hele aantal sulke soekenjins wat ontwikkel is om toegang te verleen tot die hulpbronne beskikbaar op die web en om gebruikers te help om relevante inligting vanaf die web in te win. Dit is veral noodsaaklik vir die verkryging van teksinligting vir akademiese doeleindes. Maar hoe effektief en doelmatig is die soekenjins in die herwinning van die mees relevante teksinligting vanaf die web? Watter van die soekenjins is die effektiefste? Hierdie studie is onderneem om te kyk watter soekenjins die effektiefste en doelmatigste is in die herwinning van die nodige teksinligting. Dit is belangrik om te weet watter soekenjin die effektiefste is want so 'n enjin kan gebruik word om 'n hoër getal van die mees relevante tekswebblaaie met die minimum van tyd en moeite te herwin. Heirdie studie is baseer op die sewe hoofsoekenjins, vier soektogte, en toepasliksheidsoordele soos relevant /gedeeltelik relevant/ en nie- relevant. Presiesheid en herwinningsvermoë is bereken baseer op die eksperimente en toetsresultate en dit is gebruik as basis vir statistiese evaluasie en vergelyking van die herwinningseffektiwiteit van die nege soekenjins. Gedupliseerde items en gebreekte skakels is ook aangeteken en apart ondersoek en is gebruik as bykomende maatstaf van effektiwiteit. Die reaksietyd is ook aangeteken en is gebruik as basis vir statistiese evaluasie en die vergelyking van die herwinningseffektiwiteit van die nege soekenjins. Aangesien soekenjins betrokke is by indeksering en soekprosesse, bespreek hierdie studie eers uit 'n teoretiese oogpunt, hoe indeksering en soekprosesse uitgevoer word in 'n inligtingherwinningsomgewing. Die invloed van indeksering en soekprosesse op die doeltreffendheid van herwinningsisteme in die algemeen en veral van soekenjins in die herwinning van die mees relevante teksinligting vanaf die web, word ook bespreek.

APA, Harvard, Vancouver, ISO, and other styles

23

ABEYSINGHE, RUVINI PRADEEPA. "SIGNATURE FILES FOR DOCUMENT MANAGEMENT." University of Cincinnati / OhioLINK, 2001. http://rave.ohiolink.edu/etdc/view?acc_num=ucin990539054.

Full text

APA, Harvard, Vancouver, ISO, and other styles

24

Henriksson, Aron. "Semantic Spaces of Clinical Text : Leveraging Distributional Semantics for Natural Language Processing of Electronic Health Records." Licentiate thesis, Stockholms universitet, Institutionen för data- och systemvetenskap, 2013. http://urn.kb.se/resolve?urn=urn:nbn:se:su:diva-94344.

Full text

Abstract:

The large amounts of clinical data generated by electronic health record systems are an underutilized resource, which, if tapped, has enormous potential to improve health care. Since the majority of this data is in the form of unstructured text, which is challenging to analyze computationally, there is a need for sophisticated clinical language processing methods. Unsupervised methods that exploit statistical properties of the data are particularly valuable due to the limited availability of annotated corpora in the clinical domain. Information extraction and natural language processing systems need to incorporate some knowledge of semantics. One approach exploits the distributional properties of language – more specifically, term co-occurrence information – to model the relative meaning of terms in high-dimensional vector space. Such methods have been used with success in a number of general language processing tasks; however, their application in the clinical domain has previously only been explored to a limited extent. By applying models of distributional semantics to clinical text, semantic spaces can be constructed in a completely unsupervised fashion. Semantic spaces of clinical text can then be utilized in a number of medically relevant applications. The application of distributional semantics in the clinical domain is here demonstrated in three use cases: (1) synonym extraction of medical terms, (2) assignment of diagnosis codes and (3) identification of adverse drug reactions. To apply distributional semantics effectively to a wide range of both general and, in particular, clinical language processing tasks, certain limitations or challenges need to be addressed, such as how to model the meaning of multiword terms and account for the function of negation: a simple means of incorporating paraphrasing and negation in a distributional semantic framework is here proposed and evaluated. The notion of ensembles of semantic spaces is also introduced; these are shown to outperform the use of a single semantic space on the synonym extraction task. This idea allows different models of distributional semantics, with different parameter configurations and induced from different corpora, to be combined. This is not least important in the clinical domain, as it allows potentially limited amounts of clinical data to be supplemented with data from other, more readily available sources. The importance of configuring the dimensionality of semantic spaces, particularly when – as is typically the case in the clinical domain – the vocabulary grows large, is also demonstrated.
De stora mängder kliniska data som genereras i patientjournalsystem är en underutnyttjad resurs med en enorm potential att förbättra hälso- och sjukvården. Då merparten av kliniska data är i form av ostrukturerad text, vilken är utmanande för datorer att analysera, finns det ett behov av sofistikerade metoder som kan behandla kliniskt språk. Metoder som inte kräver märkta exempel utan istället utnyttjar statistiska egenskaper i datamängden är särskilt värdefulla, med tanke på den begränsade tillgången till annoterade korpusar i den kliniska domänen. System för informationsextraktion och språkbehandling behöver innehålla viss kunskap om semantik. En metod går ut på att utnyttja de distributionella egenskaperna hos språk – mer specifikt, statistisk över hur termer samförekommer – för att modellera den relativa betydelsen av termer i ett högdimensionellt vektorrum. Metoden har använts med framgång i en rad uppgifter för behandling av allmänna språk; dess tillämpning i den kliniska domänen har dock endast utforskats i mindre utsträckning. Genom att tillämpa modeller för distributionell semantik på klinisk text kan semantiska rum konstrueras utan någon tillgång till märkta exempel. Semantiska rum av klinisk text kan sedan användas i en rad medicinskt relevanta tillämpningar. Tillämpningen av distributionell semantik i den kliniska domänen illustreras här i tre användningsområden: (1) synonymextraktion av medicinska termer, (2) tilldelning av diagnoskoder och (3) identifiering av läkemedelsbiverkningar. Det krävs dock att vissa begränsningar eller utmaningar adresseras för att möjliggöra en effektiv tillämpning av distributionell semantik på ett brett spektrum av uppgifter som behandlar språk – både allmänt och, i synnerhet, kliniskt – såsom hur man kan modellera betydelsen av flerordstermer och redogöra för funktionen av negation: ett enkelt sätt att modellera parafrasering och negation i ett distributionellt semantiskt ramverk presenteras och utvärderas. Idén om ensembler av semantisk rum introduceras också; dessa överträffer användningen av ett enda semantiskt rum för synonymextraktion. Den här metoden möjliggör en kombination av olika modeller för distributionell semantik, med olika parameterkonfigurationer samt inducerade från olika korpusar. Detta är inte minst viktigt i den kliniska domänen, då det gör det möjligt att komplettera potentiellt begränsade mängder kliniska data med data från andra, mer lättillgängliga källor. Arbetet påvisar också vikten av att konfigurera dimensionaliteten av semantiska rum, i synnerhet när vokabulären är omfattande, vilket är vanligt i den kliniska domänen.
High-Performance Data Mining for Drug Effect Detection (DADEL)

APA, Harvard, Vancouver, ISO, and other styles

25

Valio, Felipe Braunger 1984. "Detecção rápida de legendas em vídeos utilizando o ritmo visual." [s.n.], 2011. http://repositorio.unicamp.br/jspui/handle/REPOSIP/275733.

Full text

Abstract:

Orientadores: Neucimar Jerônimo Leite, Hélio Pedrini
Dissertação (mestrado) - Universidade Estadual de Campinas, Instituto de Computação
Made available in DSpace on 2018-08-19T05:52:55Z (GMT). No. of bitstreams: 1 Valio_FelipeBraunger_M.pdf: 3505580 bytes, checksum: 3b20a046a5822011c617729904457d95 (MD5) Previous issue date: 2011
Resumo: Detecção de textos em imagens é um problema que vem sendo estudado a várias décadas. Existem muitos trabalhos que estendem os métodos existentes para uso em análise de vídeos, entretanto, poucos deles criam ou adaptam abordagens que consideram características inerentes dos vídeos, como as informações temporais. Um problema particular dos vídeos, que será o foco deste trabalho, é o de detecção de legendas. Uma abordagem rápida para localizar quadros de vídeos que contenham legendas é proposta baseada em uma estrutura de dados especial denominada ritmo visual. O método é robusto à detecção de legendas com respeito ao alfabeto utilizado, ao estilo de fontes, à intensidade de cores e à orientação das legendas. Vários conjuntos de testes foram utilizados em nosso experimentos para demonstrar a efetividade do método
Abstract: Detection of text in images is a problem that has been studied for several decades. There are many works that extend the existing methods for use in video analysis, however, few of them create or adapt approaches that consider the inherent characteristics of video, such as temporal information. A particular problem of the videos, which will be the focus of this work, is the detection of subtitles. A fast method for locating video frames containing captions is proposed based on a special data structure called visual rhythm. The method is robust to the detection of legends with respect to the used alphabet, font style, color intensity and subtitle orientation. Several datasets were used in our experiments to demonstrate the effectiveness of the method
Mestrado
Ciência da Computação
Mestre em Ciência da Computação

APA, Harvard, Vancouver, ISO, and other styles

26

Civera, Saiz Jorge. "Novel statistical approaches to text classification, machine translation and computer-assisted translation." Doctoral thesis, Universitat Politècnica de València, 2008. http://hdl.handle.net/10251/2502.

Full text

Abstract:

Esta tesis presenta diversas contribuciones en los campos de la clasificación automática de texto, traducción automática y traducción asistida por ordenador bajo el marco estadístico. En clasificación automática de texto, se propone una nueva aplicación llamada clasificación de texto bilingüe junto con una serie de modelos orientados a capturar dicha información bilingüe. Con tal fin se presentan dos aproximaciones a esta aplicación; la primera de ellas se basa en una asunción naive que contempla la independencia entre las dos lenguas involucradas, mientras que la segunda, más sofisticada, considera la existencia de una correlación entre palabras en diferentes lenguas. La primera aproximación dió lugar al desarrollo de cinco modelos basados en modelos de unigrama y modelos de n-gramas suavizados. Estos modelos fueron evaluados en tres tareas de complejidad creciente, siendo la más compleja de estas tareas analizada desde el punto de vista de un sistema de ayuda a la indexación de documentos. La segunda aproximación se caracteriza por modelos de traducción capaces de capturar correlación entre palabras en diferentes lenguas. En nuestro caso, el modelo de traducción elegido fue el modelo M1 junto con un modelo de unigramas. Este modelo fue evaluado en dos de las tareas más simples superando la aproximación naive, que asume la independencia entre palabras en differentes lenguas procedentes de textos bilingües. En traducción automática, los modelos estadísticos de traducción basados en palabras M1, M2 y HMM son extendidos bajo el marco de la modelización mediante mixturas, con el objetivo de definir modelos de traducción dependientes del contexto. Asimismo se extiende un algoritmo iterativo de búsqueda basado en programación dinámica, originalmente diseñado para el modelo M2, para el caso de mixturas de modelos M2. Este algoritmo de búsqueda n
Civera Saiz, J. (2008). Novel statistical approaches to text classification, machine translation and computer-assisted translation [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/2502
Palancia

APA, Harvard, Vancouver, ISO, and other styles

27

Vasireddy, Jhansi Lakshmi. "Applications of Linear Algebra to Information Retrieval." Digital Archive @ GSU, 2009. http://digitalarchive.gsu.edu/math_theses/71.

Full text

Abstract:

Some of the theory of nonnegative matrices is first presented. The Perron-Frobenius theorem is highlighted. Some of the important linear algebraic methods of information retrieval are surveyed. Latent Semantic Indexing (LSI), which uses the singular value de-composition is discussed. The Hyper-Text Induced Topic Search (HITS) algorithm is next considered; here the power method for finding dominant eigenvectors is employed. Through the use of a theorem by Sinkohrn and Knopp, a modified HITS method is developed. Lastly, the PageRank algorithm is discussed. Numerical examples and MATLAB programs are also provided.

APA, Harvard, Vancouver, ISO, and other styles

28

SILVA, Israel Batista Freitas da. "Representações cache eficientes para índices baseados em Wavelet trees." Universidade Federal de Pernambuco, 2016. https://repositorio.ufpe.br/handle/123456789/21050.

Full text

Abstract:

Submitted by Rafael Santana (rafael.silvasantana@ufpe.br) on 2017-08-30T19:22:34Z No. of bitstreams: 2 license_rdf: 811 bytes, checksum: e39d27027a6cc9cb039ad269a5db8e34 (MD5) Israel Batista Freitas da Silva.pdf: 1433243 bytes, checksum: 5b1ac5501cae385e4811343e1426e6c9 (MD5)
Made available in DSpace on 2017-08-30T19:22:34Z (GMT). No. of bitstreams: 2 license_rdf: 811 bytes, checksum: e39d27027a6cc9cb039ad269a5db8e34 (MD5) Israel Batista Freitas da Silva.pdf: 1433243 bytes, checksum: 5b1ac5501cae385e4811343e1426e6c9 (MD5) Previous issue date: 2016-12-12
CNPQ, FACEPE.
Hoje em dia, há um exponencial crescimento do volume de informação no mundo. Esta explosão cria uma demanda por técnicas mais eficientes de indexação e consulta de dados, uma vez que, para serem úteis, eles precisarão ser manipuláveis. Casamento de padrões se refere à busca de um texto menor (padrão) em um texto muito maior (texto), reportando a quantidade de ocorrências e/ou as localizações das ocorrências. Para tal, pode-se construir uma estrutura chamada índice que pré-processará o texto e permitirá que consultas sejam feitas eficientemente. A eficiência prática de um índice, além da sua eficiência teórica, pode definir o quão utilizado ele será, e isto está diretamente ligado a como ele se comporta nas arquiteturas dos computadores atuais. O principal objetivo deste estudo é analisar o uso da estrutura Wavelet Tree como índice avaliando o impacto da reorganização interna dos seus dados quanto à localidade espacial e, assim propor formas de organização que reduzam efetivamente a quantidade de cache misses ocorridos na execução de operações neste índice. Através de análises empíricas com dados simulados e dados textuais obtidos de dois repositórios públicos, avaliou-se alguns aspectos de cinco tipos de organizações para os dados da estrutura com o objetivo de compará-las quanto ao tempo de execução e quantidade de cache misses ocorridos. Adicionalmente, uma análise teórica da complexidade da quantidade de cache misses ocorridos para operação de consulta de um padrão é descrita para uma das organizações propostas. Dois experimentos realizados sugerem comportamentos assintóticos para duas das organizações analisadas. Um terceiro experimento executado mostra que, para quatro das cinco organizações apresentadas, houve uma sistemática redução na quantidade de cache misses ocorridos para a cache de menor nível. Entretanto a redução de cache misses para cache de menor nível não se refletiu integralmente numa diferença no tempo de execução das operações, tendo sido esta menos significativa, nem na quantidade de cache misses ocorridos na cache de maior nível, onde houveram variações positivas e negativas.Os resultados obtidos permitem concluir que a escolha de uma representação adequada pode acarretar numa melhora significativa de utilização da cache. Diferentemente do modelo teórico, o custo de acesso à memória responde apenas por uma fração do tempo de computação das operações sobre as Wavelet Trees, pelo que a diminuição no número de cache misses não se traduziu integralmente no tempo de execução. No entanto, este fator pode ser crítico em situações mais extremas de utilização de memória.
Today, there is an exponential growth in the volume of information in the world. This increase creates the demand for more efficient indexing and querying techniques, since, to be useful, that data needs to be manageable. Pattern matching means searching for a string (pattern) in a much bigger string (text), reporting the number of occurrences and/or its locations. To do that, we need to build a data structure known as index. This structure will preprocess the text to allow for efficient queries. The adoption of an index depends heavily on its efficiency, and this is directly related to how well it performs on current machine architectures. The main objective of this work is to analyze the Wavelet Tree data structure as an index, assessing the impact of its internal organization with respect to spatial locality, and propose ways to organize its data as to reduce the amount of cache misses incurred by its operations. We performed an empirical analysis using both real and simulated textual data to compare the running time and cache behavior of Wavelet Trees using five different proposals of internal data layout. A theoretical analysis about the cache complexity of a query operation is also presented for the most efficient layout. Two experiments suggest good asymptotic behavior for two of the analyzed layouts. A third experiment shows that for four of the five layouts, there was a systematic reduction in the number of cache misses for the lowest level cache. Despite this, this reduction was not reflected in the runtime, neither in the performance for the highest level cache. The results obtained allow us to conclude that the choice of a suitable layout can lead to a significant improvement in cache usage. Unlike the theoretical model, however, the cost of memory access only accounts for a fraction of the operations’ computation time on the Wavelet Trees, so the decrease in the number of cache misses did not translate fully into gains in the execution time. However, this factor can still be critical in more extreme memory utilization situations.

APA, Harvard, Vancouver, ISO, and other styles

29

Puigcerver, I. Pérez Joan. "A Probabilistic Formulation of Keyword Spotting." Doctoral thesis, Universitat Politècnica de València, 2019. http://hdl.handle.net/10251/116834.

Full text

Abstract:

[ES] La detección de palabras clave (Keyword Spotting, en inglés), aplicada a documentos de texto manuscrito, tiene como objetivo recuperar los documentos, o partes de ellos, que sean relevantes para una cierta consulta (query, en inglés), indicada por el usuario, entre una gran colección de documentos. La temática ha recogido un gran interés en los últimos 20 años entre investigadores en Reconocimiento de Formas (Pattern Recognition), así como bibliotecas y archivos digitales. Esta tesis, en primer lugar, define el objetivo de la detección de palabras clave a partir de una perspectiva basada en la Teoría de la Decisión y una formulación probabilística adecuada. Más concretamente, la detección de palabras clave se presenta como un caso particular de Recuperación de la Información (Information Retrieval), donde el contenido de los documentos es desconocido, pero puede ser modelado mediante una distribución de probabilidad. Además, la tesis también demuestra que, bajo las distribuciones de probabilidad correctas, el marco de trabajo desarrollada conduce a la solución óptima del problema, según múltiples medidas de evaluación utilizadas tradicionalmente en el campo. Más tarde, se utilizan distintos modelos estadísticos para representar las distribuciones necesarias: Redes Neuronales Recurrentes o Modelos Ocultos de Markov. Los parámetros de estos son estimados a partir de datos de entrenamiento, y las respectivas distribuciones son representadas mediante Transductores de Estados Finitos con Pesos (Weighted Finite State Transducers). Con el objetivo de hacer que el marco de trabajo sea práctico en grandes colecciones de documentos, se presentan distintos algoritmos para construir índices de palabras a partir de modelos probabilísticos, basados tanto en un léxico cerrado como abierto. Estos índices son muy similares a los utilizados por los motores de búsqueda tradicionales. Además, se estudia la relación que hay entre la formulación probabilística presentada y otros métodos de gran influencia en el campo de la detección de palabras clave, destacando cuáles son las limitaciones de los segundos. Finalmente, todas la aportaciones se evalúan de forma experimental, no sólo utilizando pruebas académicas estándar, sino también en colecciones con decenas de miles de páginas provenientes de manuscritos históricos. Los resultados muestran que el marco de trabajo presentado permite construir sistemas de detección de palabras clave muy rápidos y precisos, con una sólida base teórica.
[CAT] La detecció de paraules clau (Keyword Spotting, en anglès), aplicada a documents de text manuscrit, té com a objectiu recuperar els documents, o parts d'ells, que siguen rellevants per a una certa consulta (query, en anglès), indicada per l'usuari, dintre d'una gran col·lecció de documents. La temàtica ha recollit un gran interés en els últims 20 anys entre investigadors en Reconeixement de Formes (Pattern Recognition), així com biblioteques i arxius digitals. Aquesta tesi defineix l'objectiu de la detecció de paraules claus a partir d'una perspectiva basada en la Teoria de la Decisió i una formulació probabilística adequada. Més concretament, la detecció de paraules clau es presenta com un cas concret de Recuperació de la Informació (Information Retrieval), on el contingut dels documents és desconegut, però pot ser modelat mitjançant una distribució de probabilitat. A més, la tesi també demostra que, sota les distribucions de probabilitat correctes, el marc de treball desenvolupat condueix a la solució òptima del problema, segons diverses mesures d'avaluació utilitzades tradicionalment en el camp. Després, diferents models estadístics s'utilitzen per representar les distribucions necessàries: Xarxes Neuronal Recurrents i Models Ocults de Markov. Els paràmetres d'aquests són estimats a partir de dades d'entrenament, i les corresponents distribucions són representades mitjançant Transductors d'Estats Finits amb Pesos (Weighted Finite State Transducers). Amb l'objectiu de fer el marc de treball útil per a grans col·leccions de documents, es presenten distints algorismes per construir índexs de paraules a partir dels models probabilístics, tan basats en un lèxic tancat com en un obert. Aquests índexs són molt semblants als utilitzats per motors de cerca tradicionals. A més a més, s'estudia la relació que hi ha entre la formulació probabilística presentada i altres mètodes de gran influència en el camp de la detecció de paraules clau, destacant algunes limitacions dels segons. Finalment, totes les aportacions s'avaluen de forma experimental, no sols utilitzant proves acadèmics estàndard, sinó també en col·leccions amb desenes de milers de pàgines provinents de manuscrits històrics. Els resultats mostren que el marc de treball presentat permet construir sistemes de detecció de paraules clau molt acurats i ràpids, amb una sòlida base teòrica.
[EN] Keyword Spotting, applied to handwritten text documents, aims to retrieve the documents, or parts of them, that are relevant for a query, given by the user, within a large collection of documents. The topic has gained a large interest in the last 20 years among Pattern Recognition researchers, as well as digital libraries and archives. This thesis, first defines the goal of Keyword Spotting from a Decision Theory perspective. Then, the problem is tackled following a probabilistic formulation. More precisely, Keyword Spotting is presented as a particular instance of Information Retrieval, where the content of the documents is unknown, but can be modeled by a probability distribution. In addition, the thesis also proves that, under the correct probability distributions, the framework provides the optimal solution, under many of the evaluation measures traditionally used in the field. Later, different statistical models are used to represent the probability distribution over the content of the documents. These models, Hidden Markov Models or Recurrent Neural Networks, are estimated from training data, and the corresponding distributions over the transcripts of the images can be efficiently represented using Weighted Finite State Transducers. In order to make the framework practical for large collections of documents, this thesis presents several algorithms to build probabilistic word indexes, using both lexicon-based and lexicon-free models. These indexes are very similar to the ones used by traditional search engines. Furthermore, we study the relationship between the presented formulation and other seminal approaches in the field of Keyword Spotting, highlighting some limitations of the latter. Finally, all the contributions are evaluated experimentally, not only on standard academic benchmarks, but also on collections including tens of thousands of pages of historical manuscripts. The results show that the proposed framework and algorithms allow to build very accurate and very fast Keyword Spotting systems, with a solid underlying theory.
Puigcerver I Pérez, J. (2018). A Probabilistic Formulation of Keyword Spotting [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/116834
TESIS

APA, Harvard, Vancouver, ISO, and other styles

30

Zougris, Konstantinos. "Sociological Applications of Topic Extraction Techniques: Two Case Studies." Thesis, University of North Texas, 2015. https://digital.library.unt.edu/ark:/67531/metadc804982/.

Full text

Abstract:

Limited research has been conducted with regards to the applicability of topic extraction techniques in Sociology. Addressing the modern methodological opportunities, and responding to the skepticism with regards to the absence of theoretical foundations supporting the use of text analytics, I argue that Latent Semantic Analysis (LSA), complemented by other text analysis techniques and multivariate techniques, can constitute a unique hybrid method that can facilitate the sociological interpretations of web-based textual data. To illustrate the applicability of the hybrid technique, I developed two case studies. My first case study is associated with the Sociology of media. It focuses on the topic extraction and sentiment polarization among partisan texts posted on two major news sites. I find evidence of highly polarized opinions on comments posted on the Huffington Post and the Daily Caller. The highest polarizing topic was associated with a commentator’s reference on Hoodies in the context of the Trayvon Martin’s incident. My findings support contemporary research suggesting that media pundits frequently use tactics of outrage to provoke polarization of public opinion. My second case study contributes to the research domain of the Sociology of knowledge. The hybrid method revealed evidence of topical divides and topical “bridges” in the intellectual landscape of the British and the American sociological journals. My findings confirm the theoretical assertions describing Sociology as a fractured field, and partially support the existence of more globalized topics in the discipline.

APA, Harvard, Vancouver, ISO, and other styles

31

Moens, Marie-Francine. "Automatic indexing and abstracting of document texts /." Boston, Mass. [u.a.] : Kluwer Academic Publ, 2000. http://www.loc.gov/catdir/enhancements/fy0820/00020394-d.html.

Full text

APA, Harvard, Vancouver, ISO, and other styles

32

Dang, Quoc Bao. "Information spotting in huge repositories of scanned document images." Thesis, La Rochelle, 2018. http://www.theses.fr/2018LAROS024/document.

Full text

Abstract:

Ce travail vise à développer un cadre générique qui est capable de produire des applications de localisation d'informations à partir d’une caméra (webcam, smartphone) dans des très grands dépôts d'images de documents numérisés et hétérogènes via des descripteurs locaux. Ainsi, dans cette thèse, nous proposons d'abord un ensemble de descripteurs qui puissent être appliqués sur des contenus aux caractéristiques génériques (composés de textes et d’images) dédié aux systèmes de recherche et de localisation d'images de documents. Nos descripteurs proposés comprennent SRIF, PSRIF, DELTRIF et SSKSRIF qui sont construits à partir de l’organisation spatiale des points d’intérêts les plus proches autour d'un point-clé pivot. Tous ces points sont extraits à partir des centres de gravité des composantes connexes de l‘image. A partir de ces points d’intérêts, des caractéristiques géométriques invariantes aux dégradations sont considérées pour construire nos descripteurs. SRIF et PSRIF sont calculés à partir d'un ensemble local des m points d’intérêts les plus proches autour d'un point d’intérêt pivot. Quant aux descripteurs DELTRIF et SSKSRIF, cette organisation spatiale est calculée via une triangulation de Delaunay formée à partir d'un ensemble de points d’intérêts extraits dans les images. Cette seconde version des descripteurs permet d’obtenir une description de forme locale sans paramètres. En outre, nous avons également étendu notre travail afin de le rendre compatible avec les descripteurs classiques de la littérature qui reposent sur l’utilisation de points d’intérêts dédiés de sorte qu'ils puissent traiter la recherche et la localisation d'images de documents à contenu hétérogène. La seconde contribution de cette thèse porte sur un système d'indexation de très grands volumes de données à partir d’un descripteur volumineux. Ces deux contraintes viennent peser lourd sur la mémoire du système d’indexation. En outre, la très grande dimensionnalité des descripteurs peut amener à une réduction de la précision de l'indexation, réduction liée au problème de dimensionnalité. Nous proposons donc trois techniques d'indexation robustes, qui peuvent toutes être employées sans avoir besoin de stocker les descripteurs locaux dans la mémoire du système. Cela permet, in fine, d’économiser la mémoire et d’accélérer le temps de recherche de l’information, tout en s’abstrayant d’une validation de type distance. Pour cela, nous avons proposé trois méthodes s’appuyant sur des arbres de décisions : « randomized clustering tree indexing” qui hérite des propriétés des kd-tree, « kmean-tree » et les « random forest » afin de sélectionner de manière aléatoire les K dimensions qui permettent de combiner la plus grande variance expliquée pour chaque nœud de l’arbre. Nous avons également proposé une fonction de hachage étendue pour l'indexation de contenus hétérogènes provenant de plusieurs couches de l'image. Comme troisième contribution de cette thèse, nous avons proposé une méthode simple et robuste pour calculer l'orientation des régions obtenues par le détecteur MSER, afin que celui-ci puisse être combiné avec des descripteurs dédiés. Comme la plupart de ces descripteurs visent à capturer des informations de voisinage autour d’une région donnée, nous avons proposé un moyen d'étendre les régions MSER en augmentant le rayon de chaque région. Cette stratégie peut également être appliquée à d'autres régions détectées afin de rendre les descripteurs plus distinctifs. Enfin, afin d'évaluer les performances de nos contributions, et en nous fondant sur l'absence d'ensemble de données publiquement disponibles pour la localisation d’information hétérogène dans des images capturées par une caméra, nous avons construit trois jeux de données qui sont disponibles pour la communauté scientifique
This work aims at developing a generic framework which is able to produce camera-based applications of information spotting in huge repositories of heterogeneous content document images via local descriptors. The targeted systems may take as input a portion of an image acquired as a query and the system is capable of returning focused portion of database image that match the query best. We firstly propose a set of generic feature descriptors for camera-based document images retrieval and spotting systems. Our proposed descriptors comprise SRIF, PSRIF, DELTRIF and SSKSRIF that are built from spatial space information of nearest keypoints around a keypoints which are extracted from centroids of connected components. From these keypoints, the invariant geometrical features are considered to be taken into account for the descriptor. SRIF and PSRIF are computed from a local set of m nearest keypoints around a keypoint. While DELTRIF and SSKSRIF can fix the way to combine local shape description without using parameter via Delaunay triangulation formed from a set of keypoints extracted from a document image. Furthermore, we propose a framework to compute the descriptors based on spatial space of dedicated keypoints e.g SURF or SIFT or ORB so that they can deal with heterogeneous-content camera-based document image retrieval and spotting. In practice, a large-scale indexing system with an enormous of descriptors put the burdens for memory when they are stored. In addition, high dimension of descriptors can make the accuracy of indexing reduce. We propose three robust indexing frameworks that can be employed without storing local descriptors in the memory for saving memory and speeding up retrieval time by discarding distance validating. The randomized clustering tree indexing inherits kd-tree, kmean-tree and random forest from the way to select K dimensions randomly combined with the highest variance dimension from each node of the tree. We also proposed the weighted Euclidean distance between two data points that is computed and oriented the highest variance dimension. The secondly proposed hashing relies on an indexing system that employs one simple hash table for indexing and retrieving without storing database descriptors. Besides, we propose an extended hashing based method for indexing multi-kinds of features coming from multi-layer of the image. Along with proposed descriptors as well indexing frameworks, we proposed a simple robust way to compute shape orientation of MSER regions so that they can combine with dedicated descriptors (e.g SIFT, SURF, ORB and etc.) rotation invariantly. In the case that descriptors are able to capture neighborhood information around MSER regions, we propose a way to extend MSER regions by increasing the radius of each region. This strategy can be also applied for other detected regions in order to make descriptors be more distinctive. Moreover, we employed the extended hashing based method for indexing multi-kinds of features from multi-layer of images. This system are not only applied for uniform feature type but also multiple feature types from multi-layers separated. Finally, in order to assess the performances of our contributions, and based on the assessment that no public dataset exists for camera-based document image retrieval and spotting systems, we built a new dataset which has been made freely and publicly available for the scientific community. This dataset contains portions of document images acquired via a camera as a query. It is composed of three kinds of information: textual content, graphical content and heterogeneous content

APA, Harvard, Vancouver, ISO, and other styles

33

Gzawi, Mahmoud. "Désambiguïsation de l’arabe écrit et interprétation sémantique." Thesis, Lyon, 2019. http://www.theses.fr/2019LYSE2006.

Full text

Abstract:

Cette thèse se situe à l’intersection des domaines de la recherche en linguistique et du traitement automatique de la langue. Ces deux domaines se croisent pour la construction d’outils de traitement de texte, et des applications industrielles intégrant des solutions de désambiguïsation et d’interprétation de la langue.Une tâche difficile et très peu abordée et appliqué est arrivée sur les travaux de l’entreprise Techlimed, celle de l’analyse automatique des textes écrits en arabe. De nouvelles ressources sont apparues comme les lexiques de langues et les réseaux sémantiques permettant à la création de grammaires formelles d’accomplir cette tâche.Une métadonnée importante pour l’analyse de texte est de savoir « qu’est-ce qui est dit, et que signifie-t-il ? ». Le domaine de linguistique computationnelle propose des méthodes très diverses et souvent partielle pour permettre à l’ordinateur de répondre à de telles questions.L’introduction et l’application des règles de grammaire descriptives de langues dans les langages formels spécifiques au traitement de langues par ordinateur est l’objet principal de cette thèse.Au-delà de la réalisation d’un système de traitement et d’interprétation de textes en langue arabe, basé aussi sur la modélisation informatique, notre intérêt s’est porté sur l’évaluation des phénomènes linguistiques relevés par la littérature et les méthodes de leur formalisation en informatique.Dans tous les cas, nos travaux de recherche ont été testés et validés dans un cadre expérimental rigoureux autour de plusieurs formalismes et outils informatiques.Nos expérimentations concernant l'apport de la grammaire syntaxico-sémantique, a priori, ont montré une réduction importante de l’ambiguïté linguistique dans le cas de l'utilisation d’une grammaire à état fini écrite en Java et une grammaire générativetransformationnelle écrite en Prolog, intégrant des composants morphologiques, syntaxiques et sémantiques.La mise en place de notre étude a requis la construction d’outils de traitement de texte et d’outils de recherche d’information. Ces outils ont été construits par nos soins et sont disponible en Open-source.La réussite de l’application de nos travaux à grande échelle s’est conclue par la condition d’avoir de ressources sémantiques riches et exhaustives. Nous travaux ont été redirigés vers une démarche de production de telles ressources, en termes de recherche d’informations et d’extraction de connaissances. Les tests menés pour cette nouvelle perspective ont étéfavorables à d’avantage de recherche et d’expérimentation
This thesis lies at the frontier of the fields of linguistic research and the automatic processing of language. These two fields intersect for the construction of natural language processing tools, and industrial applications integrating solutions for disambiguation and interpretation of texts.A challenging task, briefly approached and applied, has come to the work of the Techlimed company, that of the automatic analysis of texts written in Arabic. Novel resources have emerged as language lexicons and semantic networks allowing the creation of formal grammars to accomplish this task.An important meta-data for text analysis is "what is being said, and what does it mean". The field of computational linguistics offers very diverse and, mostly, partial methods to allow the computer to answer such questions.The main purpose of this thesis is to introduce and apply the rules of descriptive language grammar in formal languages specific to computer language processing.Beyond the realization of a system of processing and interpretation of texts in Arabic language based on computer modeling, our interest has been devoted to the evaluation of the linguistic phenomena described by the literature and the methods of their formalization in computer science.In all cases, our research was tested and validated in a rigorous experimental framework around several formalisms and computer tools.The experiments concerning the contribution of syntaxico-semantic grammar, a priori, have demonstrated a significant reduction of linguistic ambiguity in the case of the use of a finite-state grammar written in Java and a transformational generative grammarwritten in Prolog, integrating morphological, syntactic and semantic components.The implementation of our study required the construction of tools for word processing, information retrieval tools. These tools were built by us and are available in Open-source.The success of the application of our work in large scale was concluded by the requirement of having rich and comprehensive semantic resources. Our work has been redirected towards a process of production of such resources, in terms of informationretrieval and knowledge extraction. The tests for this new perspective were favorable to further research and experimentation

APA, Harvard, Vancouver, ISO, and other styles

34

Pohlídal, Antonín. "Inteligentní emailová schránka." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2012. http://www.nusl.cz/ntk/nusl-236458.

Full text

Abstract:

This master's thesis deals with the use of text classification for sorting of incoming emails. First, there is described the Knowledge Discovery in Databases and there is also analyzed in detail the text classification with selected methods. Further, this thesis describes the email communication and SMTP, POP3 and IMAP protocols. The next part contains design of the system that classifies incoming emails and there are also described realated technologie ie Apache James Server, PostgreSQL and RapidMiner. Further, there is described the implementation of all necessary components. The last part contains an experiments with email server using Enron Dataset.

APA, Harvard, Vancouver, ISO, and other styles

35

Wu, Zimin. "A partial syntactic analysis-based pre-processor for automatic indexing and retrieval of Chinese texts." Thesis, Loughborough University, 1992. https://dspace.lboro.ac.uk/2134/13685.

Full text

Abstract:

Automatic indexing is the automatic creation of a text surrogate, normally keywords or phrases, to represent the original text. In the current English text retrieval systems, this process of content representation is accomplished by extracting words using spaces and punctuations as word delimiters. The same technique cannot easily be applied to Chinese texts which contain no obvious word boundaries; they appear to be a linear sequence of non-spaced or equally spaced ideographic characters and thenumber of characters in words varies. The solution to the problem lies in morphological and syntactic analyses of Chinese morphemes, words and phrases. The idea is inspired by the experiments on English computational morphology and its application to English text retrieval, mainly automatic compound and phrase indexing. These areas are particularly germane to Chinese because typographically there are no morph and phrase boundaries in either Chinese or English texts. The experiment is based on the hypothesis that words and phrases exceeding two Chinese characters can be characterised by a grammar that describes the concatenation behaviour of morphological and syntactic categories. This is examined using the following three procedures: (1) text segmentation - texts are divided into one and two character segments by searching a dictionary containing over 17000 morphemes and words, which are tagged with 'morphological and syntactic categories. (2) category disambiguation - for the resulting morphemes and words tagged with more than one category, the correct one is selected based on context (3) parsing - the segments are analysed using the grammar, which combines them into compound and complex words and phrases for indexing and retrieval. The utilities employed in the experiment include CCOOS, an extended version of MSOOS providing for Chinese I/O system,Chinese Wordstar for text input and Chinese dBASEIII for dictionary construction. Source codes are written in Turbo BASIC including its database toolbox. Thiny texts are drawn randomly from newspapers to form thcsample for the experiment. The results prove that the partial syntactic analysis-based approach can extract keywords with a good degree of accuracy.

APA, Harvard, Vancouver, ISO, and other styles

36

Balgar, Marek. "Vyhledávání informací v české Wikipedii." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2011. http://www.nusl.cz/ntk/nusl-412831.

Full text

Abstract:

The main task of this Masters Thesis is to understand questions of information retrieval and text classifi cation. The main research is focused on the text data, the semantic dictionaries and especially the knowledges inferred from the Wikipedia. In this thesis is also described implementation of the querying system, which is based on achieved knowledges. Finally properties and possible improvements of the system are talked over.

APA, Harvard, Vancouver, ISO, and other styles

37

ALVES, George Marcelo Rodrigues. "RISO - GCT - Determinação do contexto temporal de conceitos em textos." Universidade Federal de Campina Grande, 2016. http://dspace.sti.ufcg.edu.br:8080/jspui/handle/riufcg/469.

Full text

Abstract:

Submitted by Kilvya Braga (kilvyabraga@hotmail.com) on 2018-04-24T12:36:47Z No. of bitstreams: 1 GEORGE MARCELO RODRIGUES ALVES - DISSERTAÇÃO (PPGCC) 2016.pdf: 2788195 bytes, checksum: 45c2b3c7089a4adbd7443b1c08cd4881 (MD5)
Made available in DSpace on 2018-04-24T12:36:47Z (GMT). No. of bitstreams: 1 GEORGE MARCELO RODRIGUES ALVES - DISSERTAÇÃO (PPGCC) 2016.pdf: 2788195 bytes, checksum: 45c2b3c7089a4adbd7443b1c08cd4881 (MD5) Previous issue date: 2016-02-26
Devido ao crescimento constante da quantidade de textos disponíveis na Web, existe uma necessidade de catalogar estas informações que surgem a cada instante. No entanto, trata-se de uma tarefa árdua e na qual seres humanos são incapazes de realizar esta tarefa de maneira manual, tendo em vista a quantidade incontável de dados que são disponibilizados a cada segundo. Inúmeras pesquisas têm sido realizadas no intuito de automatizar este processo de catalogação. Uma vertente de grande utilidade para as várias áreas do conhecimento humano é a indexação de documentos com base nos contextos temporais presentes nestes documentos. Esta não é uma tarefa trivial, pois envolve a análise de informações não estruturadas presentes em linguagem natural, disponíveis nos mais diversos idiomas, dentre outras dificuldades. O objetivo principal deste trabalho é criar uma abordagem capaz de permitir a indexação de documentos, determinando mapas de tópicos enriquecidos com conceitos e as respectivas informações temporais relacionadas. Tal abordagem deu origem ao RISO-GCT (Geração de Contextos Temporais), componente do Projeto RISO (Recuperação da Informação Semântica de Objetos Textuais), que tem como objetivo criar um ambiente de indexação e recuperação semântica de documentos possibilitando uma recuperação mais acurada. O RISO-GCT utilizou os resultados de um módulo preliminar, o RISO-TT (Temporal Tagger), responsável por etiquetar informações temporais presentes em documentos e realizar o processo de normalização das expressões temporais encontradas. Deste processo foi aperfeiçoada a abordagem responsável pela normalização de expressões temporais, para que estas possam ser manipuladas mais facilmente na determinação dos contextos temporais. . Foram realizados experimentos para avaliar a eficácia da abordagem proposta nesta pesquisa. O primeiro, com o intuito de verificar se o Topic Map previamente criado pelo RISO-IC (Indexação Conceitual), foi enriquecido com as informações temporais relacionadas aos conceitos de maneira correta e o segundo, para analisar a eficácia da abordagem de normalização das expressões temporais extraídas de documentos. Os experimentos concluíram que tanto o RISO-GCT, quanto o RISO-TT incrementado obtiveram resultados superiores aos concorrentes.
Due to the constant growth of the number of texts available on the Web, there is a need to catalog that information which appear at every moment. However, it is an arduous task in which humans are unable to perform this task manually, given the increased amount of data available at every second. Numerous studies have been conducted in order to automate the cataloging process. A research line with utility for various areas of human knowledge is the indexing of documents based on temporal contexts present in these documents. This is not a trivial task, as it involves the analysis of unstructured information present in natural language, available in several languages, among other difficulties. The main objective of this work is to create a model to allow indexing of documents, creating topic maps enriched with the concepts in text and their related temporal information. This approach led to the RISO-GCT (Temporal Contexts Generation), a part of RISO Project (Semantic Information Retrieval on Text Objects), which aims to create a semantic indexing environment and retrieval of documents, enabling a more accurate recovery. RISO-GCT uses the results of a preliminary module, the RISO-TT (Temporal Tagger) responsible the labeling temporal information contained in documents and carrying out the process of normalization of temporal expressions. Found. In this module the normalization of temporal expressions has been improved, in order allow a richer temporal context determination. Experiments were conducted to evaluate the effectiveness of the approach proposed a in this research. The first, in order to verify that the topic map previously created by RISO-IC has been correctly enriched with temporal information related to the concepts correctly, and the second, to analyze the effectiveness of the normalization of expressions extracted from documents. The experiments concluded that both the RISO-GCT, as the RISO-TT, which was evolved during this work, obtained better results than similar tools.

APA, Harvard, Vancouver, ISO, and other styles

38

Wang, Juo-Wen, and 汪若文. "Automatic Classification of Text Documents by Using Latent Semantic Indexing." Thesis, 2004. http://ndltd.ncl.edu.tw/handle/09421240911724157604.

Full text

Abstract:

碩士
國立交通大學
管理學院碩士在職專班資訊管理組
92
Search and browse are both important tasks in information retrieval. Search provides a way to find information rapidly, but relying on words makes it hard to deal with the problems of synonym and polysemy. Besides, users sometimes cannot provide suitable query and cannot find the information they really need. To provide good information services, the service of browse through good classification mechanism as well as information search are very important. There are two steps in classifying documents. The first is to present documents in suitable mathematical forms. The second is to classify documents automatically by using suitable classification algorithms. Classification is a task of conceptualization. Presenting documents in conventional vector space model cannot avoid relying on words explicitly. Latent semantic indexing (LSI) is developed to find the semantic concept of document, which may be suitable for the classification of documents. This thesis is intended to study the feasibility and effect of the classification of text documents by using LSI as the presentation of documents, and using both centroid vector and k-NN as the classification algorithms. The results are compared to those of the vector space model. This study deals with the problem of one-category classification. The results show that automatic classification of text documents by using LSI along with suitable classification algorithms is feasible. But the accuracy of classification by using LSI is not as good as by using vector space model. The effect of applying LSI on multi-category classification and the effect of combining LSI with other classification algorithms need further studies.

APA, Harvard, Vancouver, ISO, and other styles

39

Golynski, Alexander. "Upper and Lower Bounds for Text Upper and Lower Bounds for Text Indexing Data Structures." Thesis, 2007. http://hdl.handle.net/10012/3509.

Full text

Abstract:

The main goal of this thesis is to investigate the complexity of a variety of problems related to text indexing and text searching. We present new data structures that can be used as building blocks for full-text indices which occupies minute space (FM-indexes) and wavelet trees. These data structures also can be used to represent labeled trees and posting lists. Labeled trees are applied in XML documents, and posting lists in search engines. The main emphasis of this thesis is on lower bounds for time-space tradeoffs for the following problems: the rank/select problem, the problem of representing a string of balanced parentheses, the text retrieval problem, the problem of computing a permutation and its inverse, and the problem of representing a binary relation. These results are divided in two groups: lower bounds in the cell probe model and lower bounds in the indexing model. The cell probe model is the most natural and widely accepted framework for studying data structures. In this model, we are concerned with the total space used by a data structure and the total number of accesses (probes) it performs to memory, while computation is free of charge. The indexing model imposes an additional restriction on the storage: the object in question must be stored in its raw form together with a small index that facilitates an efficient implementation of a given set of queries, e.g. finding rank, select, matching parenthesis, or an occurrence of a given pattern in a given text (for the text retrieval problem). We propose a new technique for proving lower bounds in the indexing model and use it to obtain lower bounds for the rank/select problem and the balanced parentheses problem. We also improve the existing techniques of Demaine and Lopez-Ortiz using compression and present stronger lower bounds for the text retrieval problem in the indexing model. The most important result of this thesis is a new technique for cell probe lower bounds. We demonstrate its strength by proving new lower bounds for the problem of representing permutations, the text retrieval problem, and the problem of representing binary relations. (Previously, there were no non-trivial results known for these problems.) In addition, we note that the lower bounds for the permutations problem and the binary relations problem are tight for a wide range of parameters, e.g. the running time of queries, the size and density of the relation.

APA, Harvard, Vancouver, ISO, and other styles

40

Maaß, Moritz G. [Verfasser]. "Analysis of algorithms and data structures for text indexing / Moritz G. Maaß." 2006. http://d-nb.info/985174366/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

41

Chang, Yu-Jen, and 張佑任. "A Research of Performance Evaluation of Mandarin Chinese Full-Text Information Retrieval--Full-Text Scan Model vs. Cluster Indexing Model." Thesis, 2002. http://ndltd.ncl.edu.tw/handle/31589747547714555995.

Full text

Abstract:

碩士
南華大學
資訊管理學系碩士班
90
Full-Text Information Retrieval is becoming an interdisciplinary interest. Mandarin Chinese Full-Text Information Retrieval is facing more basic difficulties than English context because of research lag and language nature. Lack of an objective test collection and a standard effectiveness evaluation for information retrieval experiments is the fundamental issue for Mandarin Chinese Full-Text information retrieval. In this thesis, we will introduce two different systems, including the Chinese Text Processor (CTP) developed by Academia Sinica in 1996, and the Cluster Indexing Model (CIM) developed by Huang Yun-Long in 1997. Also we will use same corpus (documents set), to evaluate system performance. Concerning the research status in Chinese, this research will have three contributions. First, analysis the fitness method of Full-Text Information Retrieval in same corpus or documents set. Second, developing a mature Cluster Indexing Model as the fundamental of advance application researches. Finally, this project will construct test collections and a standard effectiveness evaluation for Full-Text Information Retrieval researches in Chinese. Involving with medicine of Children’s Daily News (502 documents) and 21 queries. Under a series of experiments, the following conclusions are discovered: 1.The average recall of CTP is 99.02%, and its average precision is 17.72%. 2.In automatic term segmentation methods, under index dimension 100 and similarity threshold 0.3： (1)The recall of CIM-IDF is 80.73%, and the precision is 45.09%. (2)The recall of CIM-TF is 65.97%, and the precision is 43.52%. 3.In manual term segmentation methods, under index dimension 100 and similarity threshold 0.3： (1)The recall of CIM-IDF is 82.81%, and the precision is 47.11%. (2)The recall of CIM—TF is 64.81%, and the precision is 42.72%. 4.According to the results of above experiments, the following conclusions are discovered: (1)The performance of CIM-IDF is better than CTP in automatic and manual term segmentation. (2)The performance of CIM-IDF is better than CIM—TF in automatic and manual term segmentation. (3)In CIM-IDF, when index dimension greater than 80, the results show that the performance of automatic and manual term segmentation are similar. It showed clearly that automatic term segmentation methods could substitute for manual. Many researchers have devoted to developing information retrieval systems for a long time. They are find new ways of doing things from different theories and improve system of performance, but not any one system can by satisfy. However, The IR system should support different retrieval models, and relevance feedback can use to differ model in the future. Besides, research has involved many topics for discussion in Mandarin Chinese Full-Text information retrieval. However, it was lack of effectiveness evaluation in diverse information retrieval. If research could construct a standard of evaluation environment (ex. large corpus, query, relevance judgment, and a standard of evaluation), it will improve system of performance to contributive.

APA, Harvard, Vancouver, ISO, and other styles

42

Murfi, Hendri [Verfasser]. "Machine learning for text indexing : concept extraction, keyword extraction and tag recommendation / vorgelegt von Hendri Murfi." 2010. http://d-nb.info/1009119486/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

43

Qiu, Jun-Feng, and 邱俊逢. "Implementation of Web-based Files Management System and Network Spy Agent With Full-Text Indexing Capability." Thesis, 2003. http://ndltd.ncl.edu.tw/handle/94513863253063555474.

Full text

Abstract:

碩士
國立高雄第一科技大學
電腦與通訊工程所
91
Among internet environment, more and more FTP servers have been widely adopted by variety. Especially in filing management, along with the increased number of files stored, the FTP server could not support the keyword search function will limited the management. In this study, we has proposed web-based file server which will used ActiveX Control technology to implement the file management on the browser. We also used Microsoft’s Index Server to build a full-text retrieving function, user can use keyword to search files. In our system, we used ASP, JavaScript, CSS, ADSI, ActiveX, SQL, VB, API, Index Server,IIS, and so on. On the side, we design a network spy agent server by the above-mentioned techniques, the network spy agent can search the FTP server file by user’s keyword, it will spend less manpower and time. After the network spy agent search files, the result’s file will be saved on our files management server, user can detemine to reserve the file or not.

APA, Harvard, Vancouver, ISO, and other styles

44

Huang, Yun-Long, and 黃雲龍. "A Theoretic Research of Cluster Indexing for Mandarin Chinese Full Text Document--The Construction of Vector Space Model." Thesis, 1997. http://ndltd.ncl.edu.tw/handle/31705905316420373533.

Full text

APA, Harvard, Vancouver, ISO, and other styles

45

Gupta, Ankur. "Succinct Data Structures." Diss., 2007. http://hdl.handle.net/10161/434.

Full text

APA, Harvard, Vancouver, ISO, and other styles

46

"Automatic index generation for the free-text based database." Chinese University of Hong Kong, 1992. http://library.cuhk.edu.hk/record=b5887040.

Full text

Abstract:

by Leung Chi Hong.
Thesis (M.Phil.)--Chinese University of Hong Kong, 1992.
Includes bibliographical references (leaves 183-184).
Chapter Chapter one: --- Introduction --- p.1
Chapter Chapter two: --- Background knowledge and linguistic approaches of automatic indexing --- p.5
Chapter 2.1 --- Definition of index and indexing --- p.5
Chapter 2.2 --- Indexing methods and problems --- p.7
Chapter 2.3 --- Automatic indexing and human indexing --- p.8
Chapter 2.4 --- Different approaches of automatic indexing --- p.10
Chapter 2.5 --- Example of semantic approach --- p.11
Chapter 2.6 --- Example of syntactic approach --- p.14
Chapter 2.7 --- Comments on semantic and syntactic approaches --- p.18
Chapter Chapter three: --- Rationale and methodology of automatic index generation --- p.19
Chapter 3.1 --- Problems caused by natural language --- p.19
Chapter 3.2 --- Usage of word frequencies --- p.20
Chapter 3.3 --- Brief description of rationale --- p.24
Chapter 3.4 --- Automatic index generation --- p.27
Chapter 3.4.1 --- Training phase --- p.27
Chapter 3.4.1.1 --- Selection of training documents --- p.28
Chapter 3.4.1.2 --- Control and standardization of variants of words --- p.28
Chapter 3.4.1.3 --- Calculation of associations between words and indexes --- p.30
Chapter 3.4.1.4 --- Discarding false associations --- p.33
Chapter 3.4.2 --- Indexing phase --- p.38
Chapter 3.4.3 --- Example of automatic indexing --- p.41
Chapter 3.5 --- Related researches --- p.44
Chapter 3.6 --- Word diversity and its effect on automatic indexing --- p.46
Chapter 3.7 --- Factors affecting performance of automatic indexing --- p.60
Chapter 3.8 --- Application of semantic representation --- p.61
Chapter 3.8.1 --- Problem of natural language --- p.61
Chapter 3.8.2 --- Use of concept headings --- p.62
Chapter 3.8.3 --- Example of using concept headings in automatic indexing --- p.65
Chapter 3.8.4 --- Advantages of concept headings --- p.68
Chapter 3.8.5 --- Disadvantages of concept headings --- p.69
Chapter 3.9 --- Correctness prediction for proposed indexes --- p.78
Chapter 3.9.1 --- Example of using index proposing rate --- p.80
Chapter 3.10 --- Effect of subject matter on automatic indexing --- p.83
Chapter 3.11 --- Comparison with other indexing methods --- p.85
Chapter 3.12 --- Proposal for applying Chinese medical knowledge --- p.90
Chapter Chapter four: --- Simulations of automatic index generation --- p.93
Chapter 4.1 --- Training phase simulations --- p.93
Chapter 4.1.1 --- Simulation of association calculation (word diversity uncontrolled) --- p.94
Chapter 4.1.2 --- Simulation of association calculation (word diversity controlled) --- p.102
Chapter 4.1.3 --- Simulation of discarding false associations --- p.107
Chapter 4.2 --- Indexing phase simulation --- p.115
Chapter 4.3 --- Simulation of using concept headings --- p.120
Chapter 4.4 --- Simulation for testing performance of predicting index correctness --- p.125
Chapter 4.5 --- Summary --- p.128
Chapter Chapter five: --- Real case study in database of Chinese Medicinal Material Research Center --- p.130
Chapter 5.1 --- Selection of real documents --- p.130
Chapter 5.2 --- Case study one: Overall performance using real data --- p.132
Chapter 5.2.1 --- Sample results of automatic indexing for real documents --- p.138
Chapter 5.3 --- Case study two: Using multi-word terms --- p.148
Chapter 5.4 --- Case study three: Using concept headings --- p.152
Chapter 5.5 --- Case study four: Prediction of proposed index correctness --- p.156
Chapter 5.6 --- Case study five: Use of (Σ ΔRij) Fi to determine false association --- p.159
Chapter 5.7 --- Case study six: Effect of word diversity --- p.162
Chapter 5.8 --- Summary --- p.166
Chapter Chapter six: --- Conclusion --- p.168
Appendix A: List of stopwords --- p.173
Appendix B: Index terms used in case studies --- p.174
References --- p.183

APA, Harvard, Vancouver, ISO, and other styles

47

Tomeš, Jiří. "Indexace elektronických dokumentů a jejich částí." Master's thesis, 2015. http://www.nusl.cz/ntk/nusl-352314.

Full text

Abstract:

The thesis describes the design and implementation of an application for processing electronic publications (collections of conference papers, comprehensive manuals, or even classical electronic books) in order to enrich their internal navigation by hyperlinks between their related parts, respectively producing as representative as possible summarizations of given length. Unlike similar applications summarizations can be based not only on the sentences, but on elements of other categories like paragraphs, sections and the like.The main emphasis was put on ease of use, platform independence, and multilingual support. The application provides a flexible environment that can be customized to user's needs.

APA, Harvard, Vancouver, ISO, and other styles

48

Fishbein, Jonathan Michael. "Integrating Structure and Meaning: Using Holographic Reduced Representations to Improve Automatic Text Classification." Thesis, 2008. http://hdl.handle.net/10012/3819.

Full text

Abstract:

Current representation schemes for automatic text classification treat documents as syntactically unstructured collections of words (Bag-of-Words) or `concepts' (Bag-of-Concepts). Past attempts to encode syntactic structure have treated part-of-speech information as another word-like feature, but have been shown to be less effective than non-structural approaches. We propose a new representation scheme using Holographic Reduced Representations (HRRs) as a technique to encode both semantic and syntactic structure, though in very different ways. This method is unique in the literature in that it encodes the structure across all features of the document vector while preserving text semantics. Our method does not increase the dimensionality of the document vectors, allowing for efficient computation and storage. We present the results of various Support Vector Machine classification experiments that demonstrate the superiority of this method over Bag-of-Concepts representations and improvement over Bag-of-Words in certain classification contexts.

APA, Harvard, Vancouver, ISO, and other styles

49

Cerdeirinha, João Manuel Macedo. "Recuperação de imagens digitais com base no conteúdo: estudo na Biblioteca de Arte e Arquivos da Fundação Calouste Gulbenkian." Master's thesis, 2019. http://hdl.handle.net/10362/91474.

Full text

Abstract:

O crescimento massivo de dados multimédia na Internet e surgimento de novas plataformas de partilha criou grandes desafios para a recuperação de informação. As limitações de pesquisas com base em texto para este tipo de conteúdo proporcionaram o desenvolvimento de uma abordagem de recuperação de informação com base no conteúdo (CBIR) que recebeu atenção crescente nas últimas décadas. Tendo em conta as pesquisas realizadas nesta área, e sendo o foco desta investigação as imagens digitais, são explorados conceitos e técnicas associadas a esta abordagem por meio de um levantamento teórico que relata a evolução da recuperação de informação e a importância que esta temática tem para a Gestão e Curadoria da Informação. No contexto dos sistemas que têm vindo a ser desenvolvidos recorrendo à indexação automática, indicam-se as diversas aplicações deste tipo de processo. São também identificadas ferramentas relacionadas disponíveis para a realização de um estudo da aplicação deste tipo de recuperação de imagens no contexto da Biblioteca de Arte e Arquivos da Fundação Calouste Gulbenkian e as coleções fotográficas que esta é detentora no seu acervo, considerando as particularidades da instituição em que estas se inserem. Para a demonstração pretendida e de acordo com os critérios estabelecidos, recorreu-se inicialmente a soluções CBIR disponíveis em linha e, numa fase seguinte, foi usada uma ferramenta de instalação local numa coleção específica. Através deste estudo, são atestados os pontos fortes e pontos fracos da recuperação de imagens digitais com base no conteúdo face à abordagem mais tradicional com base em metainformação textual em vigor atualmente nessas coleções. Tendo em consideração as necessidades dos utilizadores dos sistemas em que estes objetos digitais se encontram indexados, a combinação entre estas técnicas pode originar resultados mais satisfatórios.
The massive growth of multimedia data on the Internet and the emergence of new sharing platforms created major challenges for information retrieval. The limitations of text-based searches for this type of content have led to the development of a content-based information retrieval approach that has received increasing attention in recent decades. Taking into account the research carried out in this area, and digital images being the focus of this research, concepts and techniques associated with this approach are explored through a theoretical survey that reports the evolution of information retrieval and the importance that this subject has for Information Management and Curation. In the context of the systems that have been developed using automatic indexing, the various applications of this type of process are indicated. Available CBIR tools are also identified for a case study of the application of this type of image retrieval in the context of the Art Library and Archives of the Calouste Gulbenkian Foundation and the photographic collections that it holds in its resources, considering the particularities of the institution to which they belong. For the intended demonstration and according to the established criteria, online CBIR tools were initially used and, in the following phase, locally installed software was selected to search and retrieve in a specific collection. Through this case study, the strengths and weaknesses of content-based image retrieval are attested against the more traditional approach based on textual metadata currently in use in these collections. Taking into consideration the needs of users of the systems in which these digital objects are indexed, combining these techniques may lead to more satisfactory results.

APA, Harvard, Vancouver, ISO, and other styles

50

Seo, Eun-Gyoung. "An experiment in automatic indexing with Korean texts a comparison of syntactico-statistical and manual methods /." 1993. http://books.google.com/books?id=jTlkAAAAMAAJ.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Text indexing'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles