Dissertations / Theses on the topic 'Latent Semantic Indexing (LSI)'

To see the other types of publications on this topic, follow the link: Latent Semantic Indexing (LSI).

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 39 dissertations / theses for your research on the topic 'Latent Semantic Indexing (LSI).'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Zhu, Weizhong Allen Robert B. "Text clustering and active learning using a LSI subspace signature model and query expansion /." Philadelphia, Pa. : Drexel University, 2009. http://hdl.handle.net/1860/3077.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

La, Fleur Magnus, and Fredrik Renström. "Conceptual Indexing using Latent Semantic Indexing : A Case Study." Thesis, Uppsala universitet, Institutionen för informationsteknologi, 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-263029.

Full text
Abstract:
Information Retrieval is concerned with locating information (usually text) that is relevant to a user's information need. Retrieval systems based on word matching suffer from the vocabulary mismatch problem, which is a common phenomenon in the usage of natural languages. This difficulty is especially severe in large, full-text databases since such databases contain many different expressions of the same concept. One method aimed to reduce the negative effects of the vocabulary mismatch problem is for the retrieval system to exploit statistical relations. This report examines the utility of conceptual indexing to improve retrieval performance of a domain specific Information Retrieval System using Latent Semantic Indexing (LSI). Techniques like LSI attempt to exploit and model global usage patterns of terms so that related documents that may not share common (literal) terms are still represented by nearby conceptual descriptors. Experimental results show that the method is noticeable more efficient, compared to baseline, for relatively complete queries. However, the current implementation did not improve the effectiveness of short, yet descriptive, queries.
APA, Harvard, Vancouver, ISO, and other styles
3

Suwannajan, Pakinee. "Evaluating the performance of latent semantic indexing." Diss., Connect to online resource, 2005. http://wwwlib.umi.com/dissertations/fullcit/3178359.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Araújo, Hugo Rafael Teixeira Soares. "Exploring biomedical literature using latent semantic indexing." Master's thesis, Universidade de Aveiro, 2012. http://hdl.handle.net/10773/11298.

Full text
Abstract:
Mestrado em Engenharia de Computadores e Telemática
O rápido crescimento de dados disponível na Internet e o facto de se encontrar maioritariamente na forma de texto não estruturado, tem criado sucessivos desafios na recuperação e indexação desta informação. Para além da Internet, também inúmeras bases de dados documentais, de áreas específicas do conhecimento, são confrontadas com este problema. Com a quantidade de informação a crescer tão rapidamente, os métodos tradicionais para indexar e recuperar informação, tornam-se insuficientes face a requisitos cada vez mais exigentes por parte dos utilizadores. Estes problemas levam à necessidade de melhorar os sistemas de recuperação de informação, usando técnicas mais poderosas e eficientes. Um desses métodos designa-se por Latent Semantic Indexing (LSI) e, tem sido sugerido como uma boa solução para modelar e analisar texto não estruturado. O LSI permite revelar a estrutura semântica de um corpus, descobrindo relações entre documentos e termos, mostrando-se uma solução robusta para o melhoramento de sistemas de recuperação de informação, especialmente a identificação de documentos relevantes para a pesquisa de um utilizador. Além disso, o LSI pode ser útil em outras tarefas tais como indexação de documentos e anotação de termos. O principal objectivo deste projeto consistiu no estudo e exploração do LSI na anotação de termos e na estruturação dos resultados de um sistema de recuperação de informação. São apresentados resultados de desempenho destes algoritmos e são igualmente propostas algumas formas para visualizar estes resultados.
The rapid increase in the amount of data available on the Internet, and the fact that this is mostly in the form of unstructured text, has brought successive challenges in information indexing and retrieval. Besides the Internet, specific literature databases are also faced with these problems. With the amount of information growing so rapidly, traditional methods for indexing and retrieving information become insufficient for the increasingly stringent requirements from users. These issues lead to the need of improving information retrieval systems using more powerful and efficient techniques. One of those methods is the Latent Semantic Indexing (LSI), which has been suggested as a good solution for modeling and analyzing unstructured text. LSI allows discovering the semantic structure in a corpus, by finding the relations between documents and terms. It is a robust solution for improving information retrieval systems, especially in the identification of relevant documents for a user's query. Besides this, LSI can be useful in other tasks such as document indexing and annotation of terms. The main goal of this project consisted in studying and exploring the LSI process for terms annotations and for structuring the retrieved documents from an information retrieval system. The performance results of these algorithms are presented and, in addition, several new forms of visualizing these results are proposed.
APA, Harvard, Vancouver, ISO, and other styles
5

Geiß, Johanna. "Latent Semantic Indexing and Information Retrieval a quest with BosSE /." [S.l. : s.n.], 2006. http://nbn-resolving.de/urn:nbn:de:bsz:16-opus-67536.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Geiss, Johanna. "Latent semantic sentence clustering for multi-document summarization." Thesis, University of Cambridge, 2011. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.609761.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Buys, Stephanus. "Log analysis aided by latent semantic mapping." Thesis, Rhodes University, 2013. http://hdl.handle.net/10962/d1002963.

Full text
Abstract:
In an age of zero-day exploits and increased on-line attacks on computing infrastructure, operational security practitioners are becoming increasingly aware of the value of the information captured in log events. Analysis of these events is critical during incident response, forensic investigations related to network breaches, hacking attacks and data leaks. Such analysis has led to the discipline of Security Event Analysis, also known as Log Analysis. There are several challenges when dealing with events, foremost being the increased volumes at which events are often generated and stored. Furthermore, events are often captured as unstructured data, with very little consistency in the formats or contents of the events. In this environment, security analysts and implementers of Log Management (LM) or Security Information and Event Management (SIEM) systems face the daunting task of identifying, classifying and disambiguating massive volumes of events in order for security analysis and automation to proceed. Latent Semantic Mapping (LSM) is a proven paradigm shown to be an effective method of, among other things, enabling word clustering, document clustering, topic clustering and semantic inference. This research is an investigation into the practical application of LSM in the discipline of Security Event Analysis, showing the value of using LSM to assist practitioners in identifying types of events, classifying events as belonging to certain sources or technologies and disambiguating different events from each other. The culmination of this research presents adaptations to traditional natural language processing techniques that resulted in improved efficacy of LSM when dealing with Security Event Analysis. This research provides strong evidence supporting the wider adoption and use of LSM, as well as further investigation into Security Event Analysis assisted by LSM and other natural language or computer-learning processing techniques.
LaTeX with hyperref package
Adobe Acrobat 9.54 Paper Capture Plug-in
APA, Harvard, Vancouver, ISO, and other styles
8

Polyakov, Serhiy. "Enhancing User Search Experience in Digital Libraries with Rotated Latent Semantic Indexing." Thesis, University of North Texas, 2015. https://digital.library.unt.edu/ark:/67531/metadc804881/.

Full text
Abstract:
This study investigates a semi-automatic method for creation of topical labels representing the topical concepts in information objects. The method is called rotated latent semantic indexing (rLSI). rLSI has found application in text mining but has not been used for topical labels generation in digital libraries (DLs). The present study proposes a theoretical model and an evaluation framework which are based on the LSA theory of meaning and investigates rLSI in a DL environment. The proposed evaluation framework for rLSI topical labels is focused on human-information search behavior and satisfaction measures. The experimental systems that utilize those topical labels were built for the purposes of evaluating user satisfaction with the search process. A new instrument was developed for this study and the experiment showed high reliability of the measurement scales and confirmed the construct validity. Data was collected through the information search tasks performed by 122 participants using two experimental systems. A quantitative method of analysis, partial least squares structural equation modeling (PLS-SEM), was used to test a set of research hypotheses and to answer research questions. The results showed a not significant, indirect effect of topical label type on both guidance and satisfaction. The conclusion of the study is that topical labels generated using rLSI provide the same levels of alignment, guidance, and satisfaction with the search process as topical labels created by the professional indexers using best practices.
APA, Harvard, Vancouver, ISO, and other styles
9

Spomer, Judith E. "Latent semantic analysis and classification modeling in applications for social movement theory /." Abstract Full Text (HTML) Full Text (PDF), 2008. http://eprints.ccsu.edu/archive/00000552/02/1996FT.htm.

Full text
Abstract:
Thesis (M.S.) -- Central Connecticut State University, 2008.
Thesis advisor: Roger Bilisoly. "... in partial fulfillment of the requirements for the degree of Master of Science in Data Mining." Includes bibliographical references (leaves 122-127). Also available via the World Wide Web.
APA, Harvard, Vancouver, ISO, and other styles
10

Hockey, Andrew. "Computational modelling of the language production system : semantic memory, conflict monitoring, and cognitive control processes /." [St. Lucia, Qld.], 2006. http://www.library.uq.edu.au/pdfserve.php?image=thesisabs/absthe20099.pdf.

Full text
APA, Harvard, Vancouver, ISO, and other styles
11

Alsallal, M. "A machine learning approach for plagiarism detection." Thesis, Coventry University, 2016. http://curve.coventry.ac.uk/open/items/7e903a56-4845-4852-b1a8-2849b1cdb08a/1.

Full text
Abstract:
Plagiarism detection is gaining increasing importance due to requirements for integrity in education. The existing research has investigated the problem of plagrarim detection with a varying degree of success. The literature revealed that there are two main methods for detecting plagiarism, namely extrinsic and intrinsic. This thesis has developed two novel approaches to address both of these methods. Firstly a novel extrinsic method for detecting plagiarism is proposed. The method is based on four well-known techniques namely Bag of Words (BOW), Latent Semantic Analysis (LSA), Stylometry and Support Vector Machines (SVM). The LSA application was fine-tuned to take in the stylometric features (most common words) in order to characterise the document authorship as described in chapter 4. The results revealed that LSA based stylometry has outperformed the traditional LSA application. Support vector machine based algorithms were used to perform the classification procedure in order to predict which author has written a particular book being tested. The proposed method has successfully addressed the limitations of semantic characteristics and identified the document source by assigning the book being tested to the right author in most cases. Secondly, the intrinsic detection method has relied on the use of the statistical properties of the most common words. LSA was applied in this method to a group of most common words (MCWs) to extract their usage patterns based on the transitivity property of LSA. The feature sets of the intrinsic model were based on the frequency of the most common words, their relative frequencies in series, and the deviation of these frequencies across all books for a particular author. The Intrinsic method aims to generate a model of author “style” by revealing a set of certain features of authorship. The model’s generation procedure focuses on just one author as an attempt to summarise aspects of an author’s style in a definitive and clear-cut manner. The thesis has also proposed a novel experimental methodology for testing the performance of both extrinsic and intrinsic methods for plagiarism detection. This methodology relies upon the CEN (Corpus of English Novels) training dataset, but divides that dataset up into training and test datasets in a novel manner. Both approaches have been evaluated using the well-known leave-one-out-cross-validation method. Results indicated that by integrating deep analysis (LSA) and Stylometric analysis, hidden changes can be identified whether or not a reference collection exists.
APA, Harvard, Vancouver, ISO, and other styles
12

Zaras, Dimitrios. "Evaluating Semantic Internalization Among Users of an Online Review Platform." Thesis, University of North Texas, 2015. https://digital.library.unt.edu/ark:/67531/metadc804823/.

Full text
Abstract:
The present study draws on recent sociological literature that argues that the study of cognition and culture can benefit from theories of embodied cognition. The concept of semantic internalization is introduced, which is conceptualized as the ability to perceive and articulate the topics that are of most concern to a community as they are manifested in social discourse. Semantic internalization is partly an application of emotional intelligence in the context of community-level discourse. Semantic internalization is measured through the application of Latent Semantic Analysis. Furthermore, it is investigated whether this ability is related to an individual’s social capital and habitus. The analysis is based on data collected from the online review platform yelp.com.
APA, Harvard, Vancouver, ISO, and other styles
13

Alazzam, Iyad. "Using Information Retrieval to Improve Integration Testing." Diss., North Dakota State University, 2012. https://hdl.handle.net/10365/26508.

Full text
Abstract:
Software testing is an important factor of the software development process. Integration testing is an important and expensive level of the software testing process. Unfortunately, since the developers have limited time to perform integration testing and debugging and integration testing becomes very hard as the combinations grow in size, the chain of calls from one module to another grow in number, length, and complexity. This research is about providing new methodology for integration testing to reduce the number of test cases needed to a significant degree while returning as much of its effectiveness as possible. The proposed approach shows the best order in which to integrate the classes currently available for integration and the external method calls that should be tested and in their order for maximum effectiveness. Our approach limits the number of integration test cases. The integration test cases number depends mainly on the dependency among modules and on the number of the integrated classes in the application. The dependency among modules is determined by using an information retrieval technique called Latent Semantic Indexing (LSI). In addition, this research extends the mutation testing for use in integration testing as a method to evaluate the effectiveness of the integration testing process. We have developed a set of integration mutation operators to support development of integration mutation testing. We have conducted experiments based on ten Java applications. To evaluate the proposed methodology, we have created mutants using new mutation operators that exercise the integration testing. Our experiments show that the test cases killed more than 60% of the created mutants.
APA, Harvard, Vancouver, ISO, and other styles
14

Chen, Xin. "Human-centered semantic retrieval in multimedia databases." Birmingham, Ala. : University of Alabama at Birmingham, 2008. https://www.mhsl.uab.edu/dt/2008p/chen.pdf.

Full text
Abstract:
Thesis (Ph. D.)--University of Alabama at Birmingham, 2008.
Additional advisors: Barrett R. Bryant, Yuhua Song, Alan Sprague, Robert W. Thacker. Description based on contents viewed Oct. 8, 2008; title from PDF t.p. Includes bibliographical references (p. 172-183).
APA, Harvard, Vancouver, ISO, and other styles
15

Langley, Joseph R. "SCRIBE a clustering approach to semantic information retrieval /." Master's thesis, Mississippi State : Mississippi State University, 2006. http://sun.library.msstate.edu/ETD-db/ETD-browse/browse.

Full text
APA, Harvard, Vancouver, ISO, and other styles
16

Meqdadi, Omar Mohammed. "UNDERSTANDING AND IDENTIFYING LARGE-SCALE ADAPTIVE CHANGES FROM VERSION HISTORIES." Kent State University / OhioLINK, 2013. http://rave.ohiolink.edu/etdc/view?acc_num=kent1374791564.

Full text
APA, Harvard, Vancouver, ISO, and other styles
17

Novák, Ján. "Automatická tvorba tezauru z wikipedie." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2011. http://www.nusl.cz/ntk/nusl-236964.

Full text
Abstract:
This thesis deals with automatic acquiring thesauri from Wikipedia. It describes Wikipedia as a suitable data set for thesauri acquiring and also methods for computing semantic similarity of terms are described. The thesis also contains a description of concepts and implementation of the system for automatic thesauri acquiring. Finally, the implemented system is evaluated by the standard metrics, such as precision or recall.
APA, Harvard, Vancouver, ISO, and other styles
18

Vasireddy, Jhansi Lakshmi. "Applications of Linear Algebra to Information Retrieval." Digital Archive @ GSU, 2009. http://digitalarchive.gsu.edu/math_theses/71.

Full text
Abstract:
Some of the theory of nonnegative matrices is first presented. The Perron-Frobenius theorem is highlighted. Some of the important linear algebraic methods of information retrieval are surveyed. Latent Semantic Indexing (LSI), which uses the singular value de-composition is discussed. The Hyper-Text Induced Topic Search (HITS) algorithm is next considered; here the power method for finding dominant eigenvectors is employed. Through the use of a theorem by Sinkohrn and Knopp, a modified HITS method is developed. Lastly, the PageRank algorithm is discussed. Numerical examples and MATLAB programs are also provided.
APA, Harvard, Vancouver, ISO, and other styles
19

Hájek, Petr. "Možnosti využití netradičních kvantitativních metod při předpovídání finančních krizí." Doctoral thesis, Vysoká škola ekonomická v Praze, 2007. http://www.nusl.cz/ntk/nusl-2340.

Full text
Abstract:
Práce je rozdělena na tři části. V teoretické části práce jsou přiblíženy významné krize za posledních několik set let, typologie krizí, selhání finančních trhů dle P. Krugmana, generační modely, cenové bubliny, souvislost kapitálových toků a dluhového problému, nákaza, prevence před krizemi a jejich management. V druhé části jsou v rámci popisu současného stavu bádání v oblasti predikce finančních krizí citovány desítky studií. Jejich výsledky jsou následně porovnány. Pozornost je také věnována definici finanční krize. Ve třetí části je provedena aplikace metody Latent Semantic Indexing (LSI) na úlohu predikce finančních krizí. Testovanou hypotézou je předpoklad, že akciové trhy dokáží během jednoho čtvrtletí (64 pozorování akciového trhu) reflektovat budoucí vývoj v měnové politice (během dalších 128 pozorování). Tato hypotéza byla na vzorku 39 zemí, intervalu let 1985 - 2007 a interpretace vývoje úrokových sazeb a měnového kurzu domácí měny vůči USD v disertační práci potvrzena. Uvedená metoda LSI a její studovaná aplikace na akciovém trhu, přestože dokázala nalézt několik krizí i přesně na den, je vhodná spíše pro specifikaci a analýzu křehkých období, kdy ke krizi může dojít, než přímo k předpovídání krizí.
APA, Harvard, Vancouver, ISO, and other styles
20

Macedo, Alessandra Alaniz. "Especificação, instanciação e experimentação de um arcabouço para criação automática de ligações hipertexto entre informações homogêneas." Universidade de São Paulo, 2004. http://www.teses.usp.br/teses/disponiveis/55/55134/tde-05102004-113421/.

Full text
Abstract:
Com a evolução da informática, diferentes meios de comunicação passaram a explorar a Web como um meio de divulgação de suas informações. Diferentes fontes de informações, diferentes estilos de escrita e a curiosidade nata do ser humano despertam o interesse de leitores por conhecer mais de um relato sobre um mesmo tema. Para que a leitura de diferentes relatos com conteúdo similar seja possível, leitores precisam procurar, ler e analisar informações fornecidas por diferentes fontes de informação. Essa atividade, além de exigir grande investimento de tempo, sobrecarrega cognitivamente usuários. Faz parte das pesquisas da área de Hipermídia investigar mecanismos que apóiem usuários no processo de identificação de informações em repositórios homogêneos, sejam eles disponibilizados na Web ou não. No contexto desta tese, repositórios com informações de conteúdo homogêneo são aqueles cujas informações tratam do mesmo assunto. Esta tese tem por objetivo investigar a especificação, a instanciação e a experimentação de um arcabouço para apoiar a tarefa de criação automática de ligações hipertexto entre repositórios homogêneos. O arcabouço proposto, denominado CARe (Criação Automática de Relacionamentos), é representado por um conjunto de classes que realizam a coleta de informações a serem relacionadas e que processam essas informações para a geração de índices. Esses índices são relacionados e utilizados na criação automática de ligações hipertexto entre a informação original. A definição do arcabouço se deu após uma fase de análise de domínio na qual foram identificados requisitos e construídos componentes de software. Nessa fase, vários protótipos também foram construídos de modo iterativo
With the evolution of the Internet, distinct communication media have focused on the Web as a channel of information publishing. An immediate consequence is an abundance of sources of information and writing styles in the Web. This effect, combining with the inherent curiosity of human beings, has led Web users to look for more than a single article about a same subject. To gain access to separate on a same subject, readers need to search, read and analyze information provided by different sources of information. Besides consuming a great amount of time, that activity imposes a cognitive overhead to users. Several hypermedia researches have investigated mechanisms for supporting users during the process of identifying information on homogeneous repositories, available or not on the Web. In this thesis, homogeneous repositories are those containing information that describes a same subject. This thesis aims at investigating the specification and the construction of a framework intended to support the task of automatic creation of hypertext links between homogeneous repositories. The framework proposed, called CARe (Automatic Creation of Relationships), is composed of a set of classes, methods and relationships that gather information to be related, and also process that information for generating an index. Those indexes are related and used in the automatic creation of hypertext links among distinct excerpts of original information. The framework was defined based on a phase of domain analysis in which requirements were identified and software components were built. In that same phase several prototypes were developed in an iterative prototyping
APA, Harvard, Vancouver, ISO, and other styles
21

Zougris, Konstantinos. "Sociological Applications of Topic Extraction Techniques: Two Case Studies." Thesis, University of North Texas, 2015. https://digital.library.unt.edu/ark:/67531/metadc804982/.

Full text
Abstract:
Limited research has been conducted with regards to the applicability of topic extraction techniques in Sociology. Addressing the modern methodological opportunities, and responding to the skepticism with regards to the absence of theoretical foundations supporting the use of text analytics, I argue that Latent Semantic Analysis (LSA), complemented by other text analysis techniques and multivariate techniques, can constitute a unique hybrid method that can facilitate the sociological interpretations of web-based textual data. To illustrate the applicability of the hybrid technique, I developed two case studies. My first case study is associated with the Sociology of media. It focuses on the topic extraction and sentiment polarization among partisan texts posted on two major news sites. I find evidence of highly polarized opinions on comments posted on the Huffington Post and the Daily Caller. The highest polarizing topic was associated with a commentator’s reference on Hoodies in the context of the Trayvon Martin’s incident. My findings support contemporary research suggesting that media pundits frequently use tactics of outrage to provoke polarization of public opinion. My second case study contributes to the research domain of the Sociology of knowledge. The hybrid method revealed evidence of topical divides and topical “bridges” in the intellectual landscape of the British and the American sociological journals. My findings confirm the theoretical assertions describing Sociology as a fractured field, and partially support the existence of more globalized topics in the discipline.
APA, Harvard, Vancouver, ISO, and other styles
22

Alhindawi, Nouh Talal. "Supporting Source Code Comprehension During Software Evolution and Maintenance." Kent State University / OhioLINK, 2013. http://rave.ohiolink.edu/etdc/view?acc_num=kent1374790792.

Full text
APA, Harvard, Vancouver, ISO, and other styles
23

Pohlídal, Antonín. "Inteligentní emailová schránka." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2012. http://www.nusl.cz/ntk/nusl-236458.

Full text
Abstract:
This master's thesis deals with the use of text classification for sorting of incoming emails. First, there is described the Knowledge Discovery in Databases and there is also analyzed in detail the text classification with selected methods. Further, this thesis describes the email communication and SMTP, POP3 and IMAP protocols. The next part contains design of the system that classifies incoming emails and there are also described realated technologie ie Apache James Server, PostgreSQL and RapidMiner. Further, there is described the implementation of all necessary components. The last part contains an experiments with email server using Enron Dataset.
APA, Harvard, Vancouver, ISO, and other styles
24

Kontostathis, April. "A term co-occurrence based framework for understanding LSI [i.e. latent semantic indexing] : theory and practice /." Diss., 2003. http://gateway.proquest.com/openurl?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation&res_dat=xri:pqdiss&rft_dat=xri:pqdiss:3117161.

Full text
APA, Harvard, Vancouver, ISO, and other styles
25

Shiung, Ruei-shiang, and 熊瑞祥. "On Latent semantic Indexing." Thesis, 2006. http://ndltd.ncl.edu.tw/handle/42703613124815531926.

Full text
Abstract:
碩士
國立中正大學
應用數學研究所
94
In this article we study the well-known information retrieval(IR) system : the Latent Semantic Indexing(LSI) model. This method projects document vectors into some specific subspace of the range of the term-document matrix. We provide two new methods, called QR method and Gram-Schmidt method, for projecting document vectors into different subspaces and compare them with the LSI model. Furthermore, we show the numerical experiment in using the LSI model, the vector space model, and the QR method on Medline collection and Cranfield collection.
APA, Harvard, Vancouver, ISO, and other styles
26

Zhang, Xueshan. "Novelty Detection by Latent Semantic Indexing." Thesis, 2013. http://hdl.handle.net/10012/7560.

Full text
Abstract:
As a new topic in text mining, novelty detection is a natural extension of information retrieval systems, or search engines. Aiming at refining raw search results by filtering out old news and saving only the novel messages, it saves modern people from the nightmare of information overload. One of the difficulties in novelty detection is the inherent ambiguity of language, which is the carrier of information. Among the sources of ambiguity, synonymy proves to be a notable factor. To address this issue, previous studies mainly employed WordNet, a lexical database which can be perceived as a thesaurus. Rather than borrowing a dictionary, we proposed a statistical approach employing Latent Semantic Indexing (LSI) to learn semantic relationship automatically with the help of language resources. To apply LSI which involves matrix factorization, an immediate problem is that the dataset in novelty detection is dynamic and changing constantly. As an imitation of real-world scenario, texts are ranked in chronological order and examined one by one. Each text is only compared with those having appeared earlier, while later ones remain unknown. As a result, the data matrix starts as a one-row vector representing the first report, and has a new row added at the bottom every time we read a new document. Such a changing dataset makes it hard to employ matrix methods directly. Although LSI has long been acknowledged as an effective text mining method when considering semantic structure, it has never been used in novelty detection, nor have other statistical treatments. We tried to change this situation by introducing external text source to build the latent semantic space, onto which the incoming news vectors were projected. We used the Reuters-21578 dataset and the TREC data as sources of latent semantic information. Topics were divided into years and types in order to take the differences between them into account. Results showed that LSI, though very effective in traditional information retrieval tasks, had only a slight improvement to the performances for some data types. The extent of improvement depended on the similarity between news data and external information. A probing into the co-occurrence matrix attributed such a limited performance to the unique features of microblogs. Their short sentence lengths and restricted dictionary made it very hard to recover and exploit latent semantic information via traditional data structure.
APA, Harvard, Vancouver, ISO, and other styles
27

Lin, Chia-min, and 林家民. "Clustering Multilingual Documents: A Latent Semantic Indexing Based Approach." Thesis, 2005. http://ndltd.ncl.edu.tw/handle/69494075344886983368.

Full text
Abstract:
碩士
國立中山大學
資訊管理學系研究所
94
Document clustering automatically organizes a document collection into distinct groups of similar documents on the basis of their contents. Most of existing document clustering techniques deal with monolingual documents (i.e., documents written in one language). However, with the trend of globalization and advances in Internet technology, an organization or individual often generates/acquires and subsequently archives documents in different languages, thus creating the need for multilingual document clustering (MLDC). Motivated by its significance and need, this study designs a Latent Semantic Indexing (LSI) based MLDC technique. Our empirical evaluation results show that the proposed LSI-based multilingual document clustering technique achieves satisfactory clustering effectiveness, measured by both cluster recall and cluster precision.
APA, Harvard, Vancouver, ISO, and other styles
28

Zhuang, Ke-Ren, and 莊可任. "Automatic Presentation Slide Generation based on Latent Semantic Indexing." Thesis, 2012. http://ndltd.ncl.edu.tw/handle/97034104190672499230.

Full text
Abstract:
碩士
國立屏東商業技術學院
資訊管理系(所)
100
We proposed using Latent Semantic Analysis (LSA) to generate document summary, and further form PowerPoint slides to help researchers to organize their briefings. Comparing with automatic summarization, automatic slide generation requires: the alignment of the slide’s contents with their original chapters, and covering important issues of the document. To fulfill these two requirements, we proposed to get summary from each chapter/section, and to use LSA for topics extraction to cover document’s issues. In addition, Owing that the purpose of document summary is to extract sentences instead of terms, therefore, we suggested using sentence-by-paragraph matrix to substitute for the original term-by-sentence matrix. We evaluate the following parameters: TF (term frequency) or √TF in the frequency matrix; sentence restructuring (fix incorrect sentence segmentation and removing hyphens at the end of line) or non-restructuring. The compared summarizer includes NTU and LSA, where NTU is a non-topic extraction method; LSA is a topic extraction method. We first compared the differences between section-oriented summarization and whole document summarization. The results showed that both NTU and LSA perform better (higher F score) on section-oriented summarization than whole document summarization. It therefore verified our first idea that section- oriented summarization is more suitable for slide generation. We next compared the performances of TF and√TF. The results showed that NTU performs slightly better (but not significant) on √TF than on TF, but LSA performs poorer on √TF than TF. Therefore, TF is well enough and its speed is faster than √TF. Thirdly, we examined the effect of sentence restructuring. The results showed that both NTU and LSA can have improvements when applying sentence restructuring. The result coincides the information theory that garbage in, garbage out. When the input sentences are a mess, then the output will also be incorrect. Finally, we compared NTU, LSA-ts (term-by-sentence), LSA-sp(sentence-by-paragraph) using the whole dataset. The results showed that our proposed method LSA-sp perform best, far better than the other two. It thus demonstrated the validity of our proposed method.
APA, Harvard, Vancouver, ISO, and other styles
29

Wang, Juo-Wen, and 汪若文. "Automatic Classification of Text Documents by Using Latent Semantic Indexing." Thesis, 2004. http://ndltd.ncl.edu.tw/handle/09421240911724157604.

Full text
Abstract:
碩士
國立交通大學
管理學院碩士在職專班資訊管理組
92
Search and browse are both important tasks in information retrieval. Search provides a way to find information rapidly, but relying on words makes it hard to deal with the problems of synonym and polysemy. Besides, users sometimes cannot provide suitable query and cannot find the information they really need. To provide good information services, the service of browse through good classification mechanism as well as information search are very important. There are two steps in classifying documents. The first is to present documents in suitable mathematical forms. The second is to classify documents automatically by using suitable classification algorithms. Classification is a task of conceptualization. Presenting documents in conventional vector space model cannot avoid relying on words explicitly. Latent semantic indexing (LSI) is developed to find the semantic concept of document, which may be suitable for the classification of documents. This thesis is intended to study the feasibility and effect of the classification of text documents by using LSI as the presentation of documents, and using both centroid vector and k-NN as the classification algorithms. The results are compared to those of the vector space model. This study deals with the problem of one-category classification. The results show that automatic classification of text documents by using LSI along with suitable classification algorithms is feasible. But the accuracy of classification by using LSI is not as good as by using vector space model. The effect of applying LSI on multi-category classification and the effect of combining LSI with other classification algorithms need further studies.
APA, Harvard, Vancouver, ISO, and other styles
30

Zeng, Wei-Rong, and 曾韋榮. "Combining Latent Semantic Indexing with Information granulation for Data Mining." Thesis, 2006. http://ndltd.ncl.edu.tw/handle/pr286u.

Full text
Abstract:
碩士
國立臺北科技大學
商業自動化與管理研究所
94
With the rapid information growth, the development of data mining aims at discovering useful patterns from the huge amount of data. Enterprise data usually have features of multi-dimension, sparsity and imbalance. These features result in significant impacts on the functions of data mining. Therefore, data preprocessing has become an essential task in data mining, which can reduce the data size and remove noises and outliers. By using Singular Value Decomposition, Latent Semantic Indexing (LSI) can effectively process multidimensional and sparse data. The data possessing features of multi-dimension and sparsity can be preprocessed by using LSI to reduce the data dimension and records. In the case of processing imbalance data, Information Granulation (IG) can transform data of majority class that share similar property into information granule in order to raise the ratio of minority as well as to resolve problem of imbalance data. Therefore, LSI and IG can be taken as the first stage of data preprocessing in data mining process. This thesis combines LSI with IG for data mining in order to achieve the goal of reducing the size and dimensions of data, and resolve the problem caused by imbalance data. According to the results in this thesis, it points out that implementing LSI to data can effectively reduce the dimensions of data, implementing LSI+IG to data can effectively reduce the dimensions and the size of data, and implementing IG+LSI to data can effectively reduce the sub-attributes (generated in IG process) and the size of data. Moreover, all these three methods of data reduction can reduce the computational time of analysis. In the case of processing imbalance data, the computational results indicate that LSI alone is not suitable for preprocessing imbalance data. By implementing LSI+IG or IG+LSI to preprocessing imbalance data, the accuracy of minority class is improved. This thesis concludes that the results of classification can be most improved provided that IG+LSI is adopted.
APA, Harvard, Vancouver, ISO, and other styles
31

Lin, Guan-Hong, and 林冠宏. "Protein Function Prediction from Protein Interaction Networks by Latent Semantic Indexing." Thesis, 2005. http://ndltd.ncl.edu.tw/handle/77778843235131898005.

Full text
Abstract:
碩士
國立中央大學
資訊工程研究所
93
Determining protein function is one of the most important tasks in the post-genomic era. Large-scale biological experiment results such as protein interaction networks can be obtained now, and these data often involve the information about protein functions. In this thesis, we present an approach based on Latent Semantic Indexing (LSI) to extract this information from protein interaction networks. LSI is an information retrieval technique that can solve the synonymy and polysemy problems. Because biologists believe that there are a lot of false positives and false negatives in protein interaction networks, we use the properties of LSI to filter out the wrong and confused information retrieved from these networks. Our results show that our approach can find out the functional related proteins in cells.
APA, Harvard, Vancouver, ISO, and other styles
32

Chang, Jyh-Cheng, and 張志成. "Computer-Assisted Construction of Knowledge Map Based on Latent Semantic Indexing." Thesis, 2007. http://ndltd.ncl.edu.tw/handle/69111521837321476885.

Full text
Abstract:
博士
中原大學
電子工程研究所
95
The most important issue of constructing a concept map is not coming up with the list of concepts to involve, but linking the concepts into meaningful propositions to create a connected structure that reflects the person’s understanding of a domain. This research present a system which, during the process of concept mapping, takes the partially constructed map or an independent keyword as input to mine the Web, and suggests to the user a list of weighted concepts that are relevant to the map under construction. The system first uses the Latent Semantic Indexing (LSI) algorithm to analyze the contents on the web and transforms the contents into a set of relevant terms. Next, the relevant terms are evaluated with a sigmoid function to transform into a list of weighted concepts. The system also designs a knowledge structure, Knowledge Map, which extends from the concept map with a hierarchical structure on the computer to store the learners' concepts. The system with knowledge map is called KMap. The KMap system can also record the learners' learning processes and play the knowledge map construction processes again. At the end of this study, a prototype system is implemented and used to demonstrate the suggestion process. The system uses the popular web content, Wikipedia, as analyzing content. After analyzing the content, the system proposes three kinds of learning model, Free-style, Guided and Expert learning model, to help learner constructing the knowledge map.
APA, Harvard, Vancouver, ISO, and other styles
33

Majavu, Wabo. "Classification of web resident sensor resources using latent semantic indexing and ontologies." Thesis, 2010. http://hdl.handle.net/10539/7920.

Full text
Abstract:
Web resident sensor resource discovery plays a crucial role in the realisation of the Sensor Web. The vision of the Sensor Web is to create a web of sensors that can be manipulated and discovered in real time. A current research challenge in the sensor web is the discovery of relevant web sensor resources. The proposed approach towards solving the discovery problem is to implement a modified Latent Semantic Indexing(LSI) by making use of an ontology for classifying Web Resident Resources found in geospatial web portals. This research introduces a new method aimed at improving an information retrieval algorithm, infl uencing the vector decomposition by including a formal representation of the knowledge of the domain of interest. The aim is to bias the retrieval to better classify the resources of interest. The proposed method uses the domain knowledge, expressed in the ontology to improve the knowledge extraction by using the concept defi nitions and relationships in the ontology to create semantic links between documents. The clusters formed using the modified algorithm are analysed and performance measured by evaluating the inter-cluster distances and similarity measures within each cluster. The distances are expressed as Euclidean distances of vectors in n-dimensional latent space. The research focus is on investigating how the prior domain knowledge improves the clustering when k-means is used as the partitioning algorithm. It is observed that the modified extraction algorithm can isolate a group of documents that are used to populate the knowledge base, therefore resulting in improved storage of the documents that occur in the geospatial portal. Results found using the combination of ontology and LSI show that clusters are better separated and homogeneous clusters of more specific themes can be formed by hierarchical clustering.
APA, Harvard, Vancouver, ISO, and other styles
34

Hsiu, Min, and 何旻修. "The Research of Using Agglomerative Fuzzy K-Means Clustering in Latent Semantic Indexing." Thesis, 2010. http://ndltd.ncl.edu.tw/handle/35507679547671787463.

Full text
Abstract:
碩士
國立中央大學
資訊管理研究所
98
Due to high cost of computing latent semantic indexing has not been popular, and full computing of large datasets is still too expensive is concerned by some scholars recently, and some strategies is improved based on clustering, that allow users to query keywords with little effort on comparing with similar clusters and reduce computational cost. However, those studies using the strategies based on clustering can only be compared with a fixed number of comparisons, and the query results are limited. The Agglomerative Fuzzy K-Means Clustering algorithm is proposed to carry out clusters on the large datasets. Each cluster is analyzed with execute singular value decomposition and low-rank approximations respectively to identify each document mapping in a low dimensional vector space coordinates. When keywords are input to query through the fuzzy clustering, latent semantic indexing with similar cluster can be carried out dynamically. Finally the documents with relevant keywords can be found. The experimental results show that the study does increase the quality of information retrieval comparing with traditional methods clustered in advance effectively, and in the dynamic cluster selection related with keywords almost chooses the best clustering number. F-measure value reached 83% in a single keyword query and recall rate is as high as 85%, the mean times F-measure value is also 72% in two key words query. Proved by Agglomerative Fuzzy K-Means Clustering algorithm for clustering, most of the web pages of information can be filtered with irrelevant cluster, and latent semantic index of the huge calculation costs can be reduced.
APA, Harvard, Vancouver, ISO, and other styles
35

Lin, Chun-Yu, and 林俊宇. "Language Identification of Language-Mixed Speech Using Latent Semantic Indexing and Language Model." Thesis, 2002. http://ndltd.ncl.edu.tw/handle/5cyrat.

Full text
Abstract:
碩士
國立成功大學
資訊工程學系碩博士班
90
With the trend of globalizing information exchangeability and communication, human machine interface with multi-lingual processing ability to distinguish between languages and provide inter-connective services become increasingly important. In the multi-lingual spoken language and dialog applications, the problem of multiple language or mixed language input is crucial for speech recognition. Recent researches into automatic language identification (LID) and recognition have been addressed to keep up with the growing demand from the application side. These approaches had more emphasis on the task of determining the language in which a single utterance was spoken and can be categorized from a framework viewpoint towards building the language dependent or independent recognizer, such as Gaussian mixture modeling, single language phone or parallel phone recognition followed by language modeling, etc. In this paper, a flexible and efficient front-end architecture for language identification was proposed for speech segmentation and detection with mixed LID in a single utterance. More specially, this study focuses on: 1) adopting the Bayesian information criteria (BIC) with language-dependent acoustic features to divide input utterance into several acoustically-associated segments, 2) proposing a feature-discriminative and language dependent GMM using Latent Semantic Indexing approach to measure the strength of language for each segment, 3) integrating a VQ-based bi-gram language model into an MAP-based decision mechanism for language identification and 4) finally, applying a linear filtering and dynamic programming approaches for the precise language boundary estimation and smoothing. In order to evaluate our proposed approach, 5304 Mandarin-English mixed speech corpus (3 male speakers), 500 single language utterances with the duration of 3~5 seconds (Database 1), and 250 single language utterances with the duration of 15 seconds (Database 2) are collected. 80% corpus are used as the training database, 20% corpus are used as the testing database. Experimental results showed that the proposed mixed language decision mechanism achieved 74% accuracy and F value for the language boundary detection was 0.62. The LID rate for Database 1 and Database 2 were 0.79 and 0.90, respectively. Our proposed architecture outperforms than other well-established approaches. This study aims for multi-lingual speech recognition.
APA, Harvard, Vancouver, ISO, and other styles
36

Lukon, Shelly Candita. "A machine-aided approach to intelligent index generation using natural language processing and latent semantic anaylsis to determine the contexts and relationships among words in a corpus /." 2006. http://etd1.library.duq.edu/theses/available/etd-11022006-145614/.

Full text
APA, Harvard, Vancouver, ISO, and other styles
37

CHEN, SHIH-HSUAN, and 陳世軒. "Integrating Latent Semantic Indexing and Clustering Algorithms to Develop a Long-Term Care 2.0 App based on Spark." Thesis, 2019. http://ndltd.ncl.edu.tw/handle/wnw89d.

Full text
Abstract:
碩士
國立虎尾科技大學
資訊工程系碩士班
107
The main reason for the low usage rate of long-term care 2.0 is that most people do not understand the long-term care 2.0 system, and how to improve the awareness and usage of long-term care is the main purpose of this study. This study proposes to Integrating LSI, K-means and K-NN into Semantic Cloud Framework (ILKKSCF) to solve the above problems. The main of the system is to use the Latent Semantic Indexing (LSI), K-means and K-NN in Machine Learning and combine the Semantic Web, and to analyze Big Data based on cloud computing. Collect articles related to long-term care and divide them into words through Jieba. Matrix Market Format is used to transform words into matrix vectors, and TF-IDF is used to calculate the weights of words. Then LSI module is established to find out the hidden association between words, and then K-means algorithm is used to cluster the words. This research constructs a Long-Term Care Application Platform (LCAP), collects the user's problems about Long-Term 2.0 through LCAP, inputs them to the built LSI module, and classifies the problems by through K-NN algorithm. Finally, find out the matching articles through the Cosine Similarity to reply to the user. In addition, the user information collected by LCAP and the long-term care sites in Open Data are integrated into the long-term care sites and services recommended by Semantic Web. Due to the huge amount of data accumulated over the years, Spark is used to integrate Machine Learning and Semantic Web into cloud computing to improve the speed, and the K-value settings of Spark in LSI, K-means and cloud performance tests under different data volumes are compared. In this study, the accuracy and satisfaction of the survey are evaluated systematically. According to the results, LSI and K-means can meet the needs of the system at K=300, and the total satisfaction score of 5 is 4.15, which verifies the feasibility of ILKKSCF.
APA, Harvard, Vancouver, ISO, and other styles
38

Rodrigues, Alexandre José Monteiro. "Recomendação de conteúdos : aplicação de agrupamento distribuído a conteúdos de TV." Master's thesis, 2010. http://hdl.handle.net/10216/63414.

Full text
APA, Harvard, Vancouver, ISO, and other styles
39

Rodrigues, Alexandre José Monteiro. "Recomendação de conteúdos : aplicação de agrupamento distribuído a conteúdos de TV." Dissertação, 2010. http://hdl.handle.net/10216/63414.

Full text
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography