Дисертації: "Approximate string matching problem"

1

Cheng, Lok-lam, and 鄭樂霖. "Approximate string matching in DNA sequences." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2003. http://hub.hku.hk/bib/B29350591.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

2

Nguyen, Hong-Thinh. "Approximate string matching distance for image classification." Thesis, Saint-Etienne, 2014. http://www.theses.fr/2014STET4029/document.

Повний текст джерела

Анотація:

L'augmentation exponentielle du nombre d'images nécessite des moyens efficaces pour les classer en fonction de leur contenu visuel. Le sac de mot visuel (Bag-Of-visual-Words, BOW), en raison de sa simplicité et de sa robustesse, devient l'approche la plus populaire. Malheureusement, cette approche ne prend pas en compte de l'information spatiale, ce qui joue un rôle important dans les catégories de modélisation d'image. Récemment, Lazebnik ont introduit la représentation pyramidale spatiale (Spatial Pyramid Representation, SPR) qui a incorporé avec succès l'information spatiale dans le modèle BOW. Néanmoins, ce système de correspondance rigide empêche la SPR de gérer les variations et les transformations d'image. L'objectif principal de cette thèse est d'étudier un modèle de chaîne de correspondance plus souple qui prend l'avantage d'histogrammes de BOW locaux et se rapproche de la correspondance de la chaîne. Notre première contribution est basée sur une représentation en chaîne et une nouvelle distance d'édition (String Matching Distance, SMD) bien adapté pour les chaînes de l'histogramme qui peut calculer efficacement par programmation dynamique. Un noyau d'édition correspondant comprenant à la fois d'une pondération et d'un système pyramidal est également dérivée. La seconde contribution est une version étendue de SMD qui remplace les opérations d'insertion et de suppression par les opérations de fusion entre les symboles successifs, ce qui apporte de la souplesse labours et correspond aux images. Toutes les distances proposées sont évaluées sur plusieurs jeux de données tâche de classification et sont comparés avec plusieurs approches concurrentes
The exponential increasing of the number of images requires efficient ways to classify them based on their visual content. The most successful and popular approach is the Bag of visual Word (BoW) representation due to its simplicity and robustness. Unfortunately, this approach fails to capture the spatial image layout, which plays an important roles in modeling image categories. Recently, Lazebnik et al (2006) introduced the Spatial Pyramid Representation (SPR) which successfully incorporated spatial information into the BoW model. The idea of their approach is to split the image into a pyramidal grid and to represent each grid cell as a BoW. Assuming that images belonging to the same class have similar spatial distributions, it is possible to use a pairwise matching as similarity measurement. However, this rigid matching scheme prevents SPR to cope with image variations and transformations. The main objective of this dissertation is to study a more flexible string matching model. Keeping the idea of local BoW histograms, we introduce a new class of edit distance to compare strings of local histograms. Our first contribution is a string based image representation model and a new edit distance (called SMD for String Matching Distance) well suited for strings composed of symbols which are local BoWs. The new distance benefits from an efficient Dynamic Programming algorithm. A corresponding edit kernel including both a weighting and a pyramidal scheme is also derived. The performance is evaluated on classification tasks and compared to the standard method and several related methods. The new method outperforms other methods thanks to its ability to detect and ignore identical successive regions inside images. Our second contribution is to propose an extended version of SMD replacing insertion and deletion operations by merging operations between successive symbols. In this approach, the number of sub regions ie. the grid divisions may vary according to the visual content. We describe two algorithms to compute this merge-based distance. The first one is a greedy version which is efficient but can produce a non optimal edit script. The other one is an optimal version but it requires a 4th degree polynomial complexity. All the proposed distances are evaluated on several datasets and are shown to outperform comparable existing methods

Стилі APA, Harvard, Vancouver, ISO та ін.

3

Marco-Sola, Santiago. "Efficient approximate string matching techniques for sequence alignment." Doctoral thesis, Universitat Politècnica de Catalunya, 2017. http://hdl.handle.net/10803/460835.

Повний текст джерела

Анотація:

One of the outstanding milestones achieved in recent years in the field of biotechnology research has been the development of high-throughput sequencing (HTS). Due to the fact that at the moment it is technically impossible to decode the genome as a whole, HTS technologies read billions of relatively short chunks of a genome at random locations. Such reads then need to be located within a reference for the species being studied (that is aligned or mapped to the genome): for each read one identifies in the reference regions that share a large sequence similarity with it, therefore indicating what the read¿s point or points of origin may be. HTS technologies are able to re-sequence a human individual (i.e. to establish the differences between his/her individual genome and the reference genome for the human species) in a very short period of time. They have also paved the way for the development of a number of new protocols and methods, leading to novel insights in genomics and biology in general. However, HTS technologies also pose a challenge to traditional data analysis methods; this is due to the sheer amount of data to be processed and the need for improved alignment algorithms that can generate accurate results quickly. This thesis tackles the problem of sequence alignment as a step within the analysis of HTS data. Its contributions focus on both the methodological aspects and the algorithmic challenges towards efficient, scalable, and accurate HTS mapping. From a methodological standpoint, this thesis strives to establish a comprehensive framework able to assess the quality of HTS mapping results. In order to be able to do so one has to understand the source and nature of mapping conflicts, and explore the accuracy limits inherent in how sequence alignment is performed for current HTS technologies. From an algorithmic standpoint, this work introduces state-of-the-art index structures and approximate string matching algorithms. They contribute novel insights that can be used in practical applications towards efficient and accurate read mapping. More in detail, first we present methods able to reduce the storage space taken by indexes for genome-scale references, while still providing fast query access in order to support effective search algorithms. Second, we describe novel filtering techniques that vastly reduce the computational requirements of sequence mapping, but are nonetheless capable of giving strict algorithmic guarantees on the completeness of the results. Finally, this thesis presents new incremental algorithmic techniques able to combine several approximate string matching algorithms; this leads to efficient and flexible search algorithms allowing the user to reach arbitrary search depths. All algorithms and methodological contributions of this thesis have been implemented as components of a production aligner, the GEM-mapper, which is publicly available, widely used worldwide and cited by a sizeable body of literature. It offers flexible and accurate sequence mapping while outperforming other HTS mappers both as to running time and to the quality of the results it produces.
Uno de los avances más importantes de los últimos años en el campo de la biotecnología ha sido el desarrollo de las llamadas técnicas de secuenciación de alto rendimiento (high-throughput sequencing, HTS). Debido a las limitaciones técnicas para secuenciar un genoma, las técnicas de alto rendimiento secuencian individualmente billones de pequeñas partes del genoma provenientes de regiones aleatorias. Posteriormente, estas pequeñas secuencias han de ser localizadas en el genoma de referencia del organismo en cuestión. Este proceso se denomina alineamiento - o mapeado - y consiste en identificar aquellas regiones del genoma de referencia que comparten una alta similaridad con las lecturas producidas por el secuenciador. De esta manera, en cuestión de horas, la secuenciación de alto rendimiento puede secuenciar un individuo y establecer las diferencias de este con el resto de la especie. En última instancia, estas tecnologías han potenciado nuevos protocolos y metodologías de investigación con un profundo impacto en el campo de la genómica, la medicina y la biología en general. La secuenciación alto rendimiento, sin embargo, supone un reto para los procesos tradicionales de análisis de datos. Debido a la elevada cantidad de datos a analizar, se necesitan nuevas y mejoradas técnicas algorítmicas que puedan escalar con el volumen de datos y producir resultados precisos. Esta tesis aborda dicho problema. Las contribuciones que en ella se realizan se enfocan desde una perspectiva metodológica y otra algorítmica que propone el desarrollo de nuevos algoritmos y técnicas que permitan alinear secuencias de manera eficiente, precisa y escalable. Desde el punto de vista metodológico, esta tesis analiza y propone un marco de referencia para evaluar la calidad de los resultados del alineamiento de secuencias. Para ello, se analiza el origen de los conflictos durante la alineación de secuencias y se exploran los límites alcanzables en calidad con las tecnologías de secuenciación de alto rendimiento. Desde el punto de vista algorítmico, en el contexto de la búsqueda aproximada de patrones, esta tesis propone nuevas técnicas algorítmicas y de diseño de índices con el objetivo de mejorar la calidad y el desempeño de las herramientas dedicadas a alinear secuencias. En concreto, esta tesis presenta técnicas de diseño de índices genómicos enfocados a obtener un acceso más eficiente y escalable. También se presentan nuevas técnicas algorítmicas de filtrado con el fin de reducir el tiempo de ejecución necesario para alinear secuencias. Y, por último, se proponen algoritmos incrementales y técnicas híbridas para combinar métodos de alineamiento y mejorar el rendimiento en búsquedas donde el error esperado es alto. Todo ello sin degradar la calidad de los resultados y con garantías formales de precisión. Para concluir, es preciso apuntar que todos los algoritmos y metodologías propuestos en esta tesis están implementados y forman parte del alineador GEM. Este versátil alineador ofrece resultados de alta calidad en entornos de producción siendo varias veces más rápido que otros alineadores. En la actualidad este software se ofrece gratuitamente, tiene una amplia comunidad de usuarios y ha sido citado en numerosas publicaciones científicas.

Стилі APA, Harvard, Vancouver, ISO та ін.

4

Mäkinen, Veli. "Parameterized approximate string matching and local-similarity-based point-pattern matching." Helsinki : University of Helsinki, 2003. http://ethesis.helsinki.fi/julkaisut/mat/tieto/vk/makinen/.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

5

Siragusa, Enrico [Verfasser]. "Approximate string matching for high-throughput sequencing / Enrico Siragusa." Berlin : Freie Universität Berlin, 2015. http://d-nb.info/1074404882/34.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

6

Keng, Leng Hui. "Approximate String Matching With Dynamic Programming and Suffix Trees." UNF Digital Commons, 2006. http://digitalcommons.unf.edu/etd/196.

Повний текст джерела

Анотація:

The importance and the contribution of string matching algorithms to the modern society cannot be overstated. From basic search algorithms such as spell checking and data querying, to advanced algorithms such as DNA sequencing, trend analysis and signal processing, string matching algorithms form the foundation of many aspects in computing that have been pivotal in technological advancement. In general, string matching algorithms can be divided into the categories of exact string matching and approximate string matching. We study each area and examine some of the well known algorithms. We probe into one of the most intriguing data structure in string algorithms, the suffix tree. The lowest common ancestor extension of the suffix tree is the key to many advanced string matching algorithms. With these tools, we are able to solve string problems that were, until recently, thought intractable by many. Another interesting and relatively new data structure in string algorithms is the suffix array, which has significant breakthroughs in its linear time construction in recent years. Primarily, this thesis focuses on approximate string matching using dynamic programming and hybrid dynamic programming with suffix tree. We study both approaches in detail and see how the merger of exact string matching and approximate string matching algorithms can yield synergistic results in our experiments.

Стилі APA, Harvard, Vancouver, ISO та ін.

7

Smith, David. "Parallel approximate string matching applied to occluded object recognition." PDXScholar, 1987. https://pdxscholar.library.pdx.edu/open_access_etds/3724.

Повний текст джерела

Анотація:

This thesis develops an algorithm for approximate string matching and applies it to the problem of partially occluded object recognition. The algorithm measures the similarity of differing strings by scanning for matching substrings between strings. The length and number of matching substrings determines the amount of similarity. A classification algorithm is developed using the approximate string matching algorithm for the identification and classification of objects. A previously developed method of shape description is used for object representation.

Стилі APA, Harvard, Vancouver, ISO та ін.

8

Dubois, Simon. "Offline Approximate String Matching forInformation Retrieval : An experiment on technical documentation." Thesis, Tekniska Högskolan, Högskolan i Jönköping, JTH. Forskningsmiljö Informationsteknik, 2013. http://urn.kb.se/resolve?urn=urn:nbn:se:hj:diva-22566.

Повний текст джерела

Анотація:

Approximate string matching consists in identifying strings as similar even ifthere is a number of mismatch between them. This technique is one of thesolutions to reduce the exact matching strictness in data comparison. In manycases it is useful to identify stream variation (e.g. audio) or word declension (e.g.prefix, suffix, plural). Approximate string matching can be used to score terms in InformationRetrieval (IR) systems. The benefit is to return results even if query terms doesnot exactly match indexed terms. However, as approximate string matchingalgorithms only consider characters (nor context neither meaning), there is noguarantee that additional matches are relevant matches. This paper presents the effects of some approximate string matchingalgorithms on search results in IR systems. An experimental research design hasbeen conducting to evaluate such effects from two perspectives. First, resultrelevance is analysed with precision and recall. Second, performance is measuredthanks to the execution time required to compute matches. Six approximate string matching algorithms are studied. Levenshtein andDamerau-Levenshtein computes edit distance between two terms. Soundex andMetaphone index terms based on their pronunciation. Jaccard similarity calculatesthe overlap coefficient between two strings. Tests are performed through IR scenarios regarding to different context,information need and search query designed to query on a technicaldocumentation related to software development (man pages from Ubuntu). Apurposive sample is selected to assess document relevance to IR scenarios andcompute IR metrics (precision, recall, F-Measure). Experiments reveal that all tested approximate matching methods increaserecall on average, but, except Metaphone, they also decrease precision. Soundexand Jaccard Similarity are not advised because they fail on too many IR scenarios.Highest recall is obtained by edit distance algorithms that are also the most timeconsuming. Because Levenshtein-Damerau has no significant improvementcompared to Levenshtein but costs much more time, the last one is recommendedfor use with a specialised documentation. Finally some other related recommendations are given to practitioners toimplement IR systems on technical documentation.

Стилі APA, Harvard, Vancouver, ISO та ін.

9

Pockrandt, Christopher Maximilian [Verfasser]. "Approximate String Matching : Improving Data Structures and Algorithms / Christopher Maximilian Pockrandt." Berlin : Freie Universität Berlin, 2019. http://d-nb.info/1183675879/34.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

10

Richter, Christoph Jan [Verfasser]. "Über die Vermeidung redundanter Betrachtungen beim Approximate String Matching / Christoph Jan Richter." Dortmund : Universitätsbibliothek Technische Universität Dortmund, 2005. http://d-nb.info/1011533499/34.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

11

Jupin, Joseph. "Temporal Graph Record Linkage and k-Safe Approximate Match." Diss., Temple University Libraries, 2016. http://cdm16002.contentdm.oclc.org/cdm/ref/collection/p245801coll10/id/412419.

Повний текст джерела

Анотація:

Computer and Information Science
Ph.D.
Since the advent of electronic data processing, organizations have accrued vast amounts of data contained in multiple databases with no reliable global unique identifier. These databases were developed by different departments for different purposes at different times. Organizing and analyzing these data for human services requires linking records from all sources. RL (Record Linkage) is a process that connects records that are related to the identical or a sufficiently similar entity from multiple heterogeneous databases. RL is a data and compute intensive, mission critical process. The process must be efficient enough to process big data and effective enough to provide accurate matches. We have evaluated an RL system that is currently in use by a local health and human services department. We found that they were using the typical approach that was offered by Fellegi and Sunter with tuple-by-tuple processing, using the Soundex as the primary approximate string matching method. The Soundex has been found to be unreliable both as a phonetic and as an approximate string matching method. We found that their data, in many cases, has more than one value per field, suggesting that the data were queried from a 5NF data base. Consider that if a woman has been married 3 times, she may have up to 4 last names on record. This query process produced more than one tuple per database/entity apparently generating a Cartesian product of this data. In many cases, more than a dozen tuples were observed for a single database/entity. This approach is both ineffective and inefficient. An effective RL method should handle this multi-data without redundancy and use edit-distance for approximate string matching. However, due to high computational complexity, edit-distance will not scale well with big data problems. We developed two methodologies for resolving the aforementioned issues: PSH and ALIM. PSH – The Probabilistic Signature Hash is a composite method that increases the speed of Damerau-Levenshtein edit-distance. It combines signature filtering, probabilistic hashing, length filtering and prefix pruning to increase the speed of edit-distance. It is also lossless because it does not lose any true positive matches. ALIM – Aggregate Link and Iterative Match is a graph-based record linkage methodology that uses a multi-graph to store demographic data about people. ALIM performs string matching as records are inserted into the graph. ALIM eliminates data redundancy and stores the relationships between data. We tested PSH for string comparison and found it to be approximately 6,000 times faster than DL. We tested it against the trie-join methods and found that they are up to 6.26 times faster but lose between 10 and 20 percent of true positives. We tested ALIM against a method currently in use by a local health and human services department and found ALIM to produce significantly more matches (even with more restrictive match criteria) and that ALIM ran more than twice as fast. ALIM handles the multi-data problem and PSH allows the use of edit-distance comparison in this RL model. ALIM is more efficient and effective than a currently implemented RL system. This model can also be expanded to perform social network analysis and temporal data modeling. For human services, temporal modeling can reveal how policy changes and treatments affect clients over time and social network analysis can determine the effects of these on whole families by facilitating family linkage.
Temple University--Theses

Стилі APA, Harvard, Vancouver, ISO та ін.

12

Owolabi, Olumide. "Hardware and software approximate string matching and pattern recognition for intelligent knowledge based systems." Thesis, University of Strathclyde, 1988. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.280018.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

13

黎少斌 and Shiao-bun Lai. "Trading off time for space for the string matching problem." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 1996. http://hub.hku.hk/bib/B31214216.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

14

Lai, Shiao-bun. "Trading off time for space for the string matching problem /." Hong Kong : University of Hong Kong, 1996. http://sunzi.lib.hku.hk/hkuto/record.jsp?B18061795.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

15

Fogla, Prahlad. "Improving the Efficiency and Robustness of Intrusion Detection Systems." Diss., Georgia Institute of Technology, 2007. http://hdl.handle.net/1853/19772.

Повний текст джерела

Анотація:

With the increase in the complexity of computer systems, existing security measures are not enough to prevent attacks. Intrusion detection systems have become an integral part of computer security to detect attempted intrusions. Intrusion detection systems need to be fast in order to detect intrusions in real time. Furthermore, intrusion detection systems need to be robust against the attacks which are disguised to evade them. We improve the runtime complexity and space requirements of a host-based anomaly detection system that uses q-gram matching. q-gram matching is often used for approximate substring matching problems in a wide range of application areas, including intrusion detection. During the text pre-processing phase, we store all the q-grams present in the text in a tree. We use a tree redundancy pruning algorithm to reduce the size of the tree without losing any information. We also use suffix links for fast linear-time q-gram search during query matching. We compare our work with the Rabin-Karp based hash-table technique, commonly used for multiple q-gram matching. To analyze the robustness of network anomaly detection systems, we develop a new class of polymorphic attacks called polymorphic blending attacks, that can effectively evade payload-based network anomaly IDSs by carefully matching the statistics of the mutated attack instances to the normal profile. Using PAYL anomaly detection system for our case study, we show that these attacks are practically feasible. We develop a formal framework which is used to analyze polymorphic blending attacks for several network anomaly detection systems. We show that generating an optimal polymorphic blending attack is NP-hard for these anomaly detection systems. However, we can generate polymorphic blending attacks using the proposed approximation algorithms. The framework can also be used to improve the robustness of an intrusion detector. We suggest some possible countermeasures one can take to improve the robustness of an intrusion detection system against polymorphic blending attacks.

Стилі APA, Harvard, Vancouver, ISO та ін.

16

Bergman, John. "Efficient fuzzy type-ahead search on big data using a ranked trie data structure." Thesis, Umeå universitet, Institutionen för fysik, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-145029.

Повний текст джерела

Анотація:

The efficiency of modern search engines depends on how well they present typo-corrected results to a user while typing. So-called fuzzy type-ahead search combines fuzzy string matching and search-as-you-type functionality, and creates a powerful tool for exploring indexed data. Current fuzzy type-ahead search algorithms work well on small data sets, but for big data of social networking services such as Facebook, e-commerce sites such as Amazon, or media streaming services such as YouTube, responsive fuzzy type-ahead search remains a great challenge. This thesis describes a method that enables responsive type-ahead search combined with fuzzy string matching on big data by keeping the search time optimal for human interaction at the expense of lower accuracy for less popular records when a query contains typos. This makes the method effective for e-commerce and media services where the popularity of search terms is a result of human behaviour and thus often follow a power-law distribution.
Effektiviteten hos moderna sökmotorer beror på hur väl de presenterar rättstavade resultat för en användare medan en sökning skrivs. Så kallad fuzzy type-ahead sök kombinerar approximativ strängmatchning och sök-medan-du-skriver funktionalitet, vilket skapar ett kraftfullt verktyg för att utforska data. Dagens algoritmer för fuzzy type-ahead sök fungerar väl för små mängder data, men för data i storleksordningen “big data” från t.ex sociala nätverkstjänster så som Facebook, e-handelssidor så som Amazon, eller media tjänster så som YouTube, är en responsiv fuzzy type-ahead sök ännu en stor utmaning. Denna avhandling beskriver en metod som möjliggör responsiv type-ahead sök kombinerat med approximativ strängmatchning för big data genom att hålla söktiden optimal för mänsklig interaktion på bekostnad av lägre precision för mindre populär information när en sök-förfrågan innehåller felstavningar. Detta gör metoden effektiv för e-handel och mediatjänster där populariteten av sök-termer är ett resultat av mänskligt beteende vilket ofta följer en potens-lag distribution.

Стилі APA, Harvard, Vancouver, ISO та ін.

17

Toth, Róbert. "Přibližné vyhledávání řetězců v předzpracovaných dokumentech." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2014. http://www.nusl.cz/ntk/nusl-236122.

Повний текст джерела

Анотація:

This thesis deals with the problem of approximate string matching, also called string matching allowing errors. The thesis targets the area of offline algorithms, which allows very fast pattern matching thanks to index created during initial text preprocessing phase. Initially, we will define the problem itself and demonstrate variety of its applications, followed by short survey of different approaches to cope with this problem. Several existing algorithms based on suffix trees will be explained in detail and new hybrid algorithm will be proposed. Algorithms wil be implemented in C programming language and thoroughly compared in series of experiments with focus on newly presented algorithm.

Стилі APA, Harvard, Vancouver, ISO та ін.

18

Källberg, David. "Nonparametric Statistical Inference for Entropy-type Functionals." Doctoral thesis, Umeå universitet, Institutionen för matematik och matematisk statistik, 2013. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-79976.

Повний текст джерела

Анотація:

In this thesis, we study statistical inference for entropy, divergence, and related functionals of one or two probability distributions. Asymptotic properties of particular nonparametric estimators of such functionals are investigated. We consider estimation from both independent and dependent observations. The thesis consists of an introductory survey of the subject and some related theory and four papers (A-D). In Paper A, we consider a general class of entropy-type functionals which includes, for example, integer order Rényi entropy and certain Bregman divergences. We propose U-statistic estimators of these functionals based on the coincident or epsilon-close vector observations in the corresponding independent and identically distributed samples. We prove some asymptotic properties of the estimators such as consistency and asymptotic normality. Applications of the obtained results related to entropy maximizing distributions, stochastic databases, and image matching are discussed. In Paper B, we provide some important generalizations of the results for continuous distributions in Paper A. The consistency of the estimators is obtained under weaker density assumptions. Moreover, we introduce a class of functionals of quadratic order, including both entropy and divergence, and prove normal limit results for the corresponding estimators which are valid even for densities of low smoothness. The asymptotic properties of a divergence-based two-sample test are also derived. In Paper C, we consider estimation of the quadratic Rényi entropy and some related functionals for the marginal distribution of a stationary m-dependent sequence. We investigate asymptotic properties of the U-statistic estimators for these functionals introduced in Papers A and B when they are based on a sample from such a sequence. We prove consistency, asymptotic normality, and Poisson convergence under mild assumptions for the stationary m-dependent sequence. Applications of the results to time-series databases and entropy-based testing for dependent samples are discussed. In Paper D, we further develop the approach for estimation of quadratic functionals with m-dependent observations introduced in Paper C. We consider quadratic functionals for one or two distributions. The consistency and rate of convergence of the corresponding U-statistic estimators are obtained under weak conditions on the stationary m-dependent sequences. Additionally, we propose estimators based on incomplete U-statistics and show their consistency properties under more general assumptions.

Стилі APA, Harvard, Vancouver, ISO та ін.

19

Neto, Domingos Soares. "Filtros para a busca e extração de padrões aproximados em cadeias biológicas." Universidade de São Paulo, 2008. http://www.teses.usp.br/teses/disponiveis/45/45134/tde-19102009-002745/.

Повний текст джерела

Анотація:

Esta dissertação de mestrado aborda formulações computacionais e algoritmos para a busca e extração de padrões em cadeias biológicas. Em particular, o presente texto concentra-se nos dois problemas a seguir, considerando-os sob as distâncias de Hamming e Levenshtein: a) como determinar os locais nos quais um dado padrão ocorre de modo aproximado em uma cadeia fornecida; b) como extrair padrões que ocorram de modo aproximado em um número significativo de cadeias de um conjunto fornecido. O primeiro problema, para o qual já existem diversos algoritmos polinomiais, tem recebido muita atenção desde a década de 60, e ganhou novos ares com o advento da biologia computacional, nos idos dos anos 80, e com a popularização da Internet e seus mecanismos de busca: ambos os fenômenos trouxeram novos obstáculos a serem superados, em razão do grande volume de dados e das bastante justas restrições de tempo inerentes a essas aplicações. O segundo problema, de surgimento um pouco mais recente, é intrinsicamente desafiador, em razão de sua complexidade computacional, do tamanho das entradas tratadas nas aplicações mais comuns e de sua dificuldade de aproximação. Também é de chamar a atenção o seu grande potencial de aplicação. Neste trabalho são apresentadas formulações adequadas dos problemas abordados, assim como algoritmos e estruturas de dados essenciais ao seu estudo. Em especial, estudamos a extremamente versátil árvore dos sufixos, assim como uma de suas generalizações e sua estrutura irmã: o vetor dos sufixos. Grande parte do texto é dedicada aos filtros baseados em q-gramas para a busca aproximada de padrões e algumas de suas mais recentes variações. Estão cobertos os algoritmos bit-paralelos de Myers e Baeza-Yates-Gonnet para a busca de padrões; os algoritmos de Sagot para a extração de padrões; os algoritmos de filtragem de Ukkonen, Jokinen-Ukkonen, Burkhardt-Kärkkäinen, entre outros.
This thesis deals with computational formulations and algorithms for the extraction and search of patterns from biological strings. In particular, the present text focuses on the following problems, both considered under Hamming and Levenshtein distances: 1. How to find the positions where a given pattern approximatelly occurs in a given string; 2. How to extract patterns which approximatelly occurs in a certain number of strings from a given set. The first problem, for which there are many polinomial time algorithms, has been receiving a lot of attention since the 60s and entered a new era of discoveries with the advent of computational biology, in the 80s, and the widespread of the Internet and its search engines: both events brought new challenges to be faced by virtue of the large volume of data usually held by such applications and its time constraints. The second problem, much younger, is very challenging due to its computational complexity, approximation hardness and the size of the input data usually held by the most common applications. This problem is also very interesting due to its potential of application. In this work we show computational formulations, algorithms and data structures for those problems. We cover the bit-parallel algorithms of Myers, Baeza-Yates-Gonnet and the Sagots algorithms for patterns extraction. We also cover here the oustanding versatile suffix tree, its generalised version, and a similar data structure: the suffix array. A significant part of the present work focuses on q-gram based filters designed to solve the approximate pattern search problem. More precisely, we cover the filter algorithms of Ukkonen, Jokinen-Ukkonen and Burkhardt-Kärkkäinen, among others.

Стилі APA, Harvard, Vancouver, ISO та ін.

20

Тодоріко, Ольга Олексіївна. "Моделі та методи очищення та інтеграції текстових даних в інформаційних системах". Thesis, Запорізький національний університет, 2016. http://repository.kpi.kharkov.ua/handle/KhPI-Press/21856.

Повний текст джерела

Анотація:

Дисертація на здобуття наукового ступеня кандидата технічних наук за спеціальністю 05.13.06 – інформаційні технології. – Національний технічний університет "Харківський політехнічний інститут", Харків, 2016. У дисертаційній роботі вирішена актуальна науково-практична задача підвищення ефективності та якості технології очищення та інтеграції текстових даних в довідкових і пошукових інформаційних системах за рахунок використання моделей словозмінної парадигми та методу побудови лексемного індексу при організації пошуку за схожістю. Розроблено моделі словозмінної парадигми, що включають представлення слів та обчислення приблизної міри схожості між ними. Розроблено метод побудови лексемного індексу, що базується на запропонованих моделях словозмінної парадигми та дозволяє відобразити слово і всі його словоформи в один запис індексу. Удосконалено метод пошуку за схожістю за рахунок покращення етапу попередньої фільтрації завдяки використанню розробленої моделі словозмінної парадигми та лексемного індексу. Виконана експериментальна оцінка ефективності вказує на високу точність та 99 0,5 % повноту. Удосконалено інформаційну технологію очищення та інтеграції даних за рахунок розроблених моделей та методів. Розроблено програмну реалізацію, яка на базі запропонованих моделей та методів виконує пошук за схожістю, очищення та інтеграцію наборів даних. Одержані в роботі теоретичні та практичні результати впроваджено у виробничий процес документообігу приймальної комісії та навчальний процес математичного факультету Державного вищого навчального закладу "Запорізький національний університет".
The thesis for the candidate degree in technical sciences, speciality 05.13.06 – Information Technologies. – National Technical University "Kharkiv Polytechnic Institute", Kharkiv, 2016. In the thesis the actual scientific and practical problem of increasing the efficiency and quality of cleaning and integration of data in information reference system and information retrieval system is solved. The improvement of information technology of cleaning and integration of data is achieved by reduction of quantity of mistakes in text information by means of use of model of an inflectional paradigm, methods of creation of a lexeme index, advanced methods of tolerant retrieval. The developed model of an inflectional paradigm includes a representation of words as an ordered collection of signatures and an approximate measure of similarity between two representations. The model differs in method of dealing with forms of words and character positions. It provides the basis for the implementation of improved methods of tolerant retrieval, cleaning and integration of datasets. The method of creation of the lexeme index which is based on the offered model of an inflectional paradigm is developed, and it allows mapping a word and all its forms to a record of the index. The method of tolerant retrieval is improved at preliminary filtration stage thanks to the developed model of an inflectional paradigm and the lexeme index. The experimental efficiency evaluation indicates high precision and 99  0,5 % recall. The information technology of cleaning and integration of data is improved using the developed models and methods. The software which on the basis of the developed models and methods carries out tolerant retrieval, cleaning and integration of data sets was developed. Theoretical and practical results of the thesis are introduced in production of document flow of an entrance committee and educational process of mathematical faculty of the State institution of higher education "Zaporizhzhya National University".

Стилі APA, Harvard, Vancouver, ISO та ін.

21

Тодоріко, Ольга Олексіївна. "Моделі та методи очищення та інтеграції текстових даних в інформаційних системах". Thesis, НТУ "ХПІ", 2016. http://repository.kpi.kharkov.ua/handle/KhPI-Press/21853.

Повний текст джерела

Анотація:

Дисертація на здобуття наукового ступеня кандидата технічних наук за спеціальністю 05.13.06 – інформаційні технології. – Національний технічний університет «Харківський політехнічний інститут», Харків, 2016. У дисертаційній роботі вирішена актуальна науково-практична задача підвищення ефективності та якості технології очищення та інтеграції текстових даних в довідкових і пошукових інформаційних системах за рахунок використання моделей словозмінної парадигми та методу побудови лексемного індексу при організації пошуку за схожістю. Розроблено моделі словозмінної парадигми, що включають представлення слів та обчислення приблизної міри схожості між ними. Розроблено метод побудови лексемного індексу, що базується на запропонованих моделях словозмінної парадигми та дозволяє відобразити слово і всі його словоформи в один запис індексу. Удосконалено метод пошуку за схожістю за рахунок покращення етапу попередньої фільтрації завдяки використанню розробленої моделі словозмінної парадигми та лексемного індексу. Виконана експериментальна оцінка ефективності вказує на високу точність та 99 0,5 % повноту. Удосконалено інформаційну технологію очищення та інтеграції даних за рахунок розроблених моделей та методів. Розроблено програмну реалізацію, яка на базі запропонованих моделей та методів виконує пошук за схожістю, очищення та інтеграцію наборів даних. Одержані в роботі теоретичні та практичні результати впроваджено у виробничий процес документообігу приймальної комісії та навчальний процес математичного факультету Державного вищого навчального закладу «Запорізький національний університет».
The thesis for the candidate degree in technical sciences, speciality 05.13.06 – Information Technologies. – National Technical University «Kharkiv Polytechnic Institute», Kharkiv, 2016. In the thesis the actual scientific and practical problem of increasing the efficiency and quality of cleaning and integration of data in information reference system and information retrieval system is solved. The improvement of information technology of cleaning and integration of data is achieved by reduction of quantity of mistakes in text information by means of use of model of an inflectional paradigm, methods of creation of a lexeme index, advanced methods of tolerant retrieval. The developed model of an inflectional paradigm includes a representation of words as an ordered collection of signatures and an approximate measure of similarity between two representations. The model differs in method of dealing with forms of words and character positions. It provides the basis for the implementation of improved methods of tolerant retrieval, cleaning and integration of datasets. The method of creation of the lexeme index which is based on the offered model of an inflectional paradigm is developed, and it allows mapping a word and all its forms to a record of the index. The method of tolerant retrieval is improved at preliminary filtration stage thanks to the developed model of an inflectional paradigm and the lexeme index. The experimental efficiency evaluation indicates high precision and 99  0,5 % recall. The information technology of cleaning and integration of data is improved using the developed models and methods. The software which on the basis of the developed models and methods carries out tolerant retrieval, cleaning and integration of data sets was developed. Theoretical and practical results of the thesis are introduced in production of document flow of an entrance committee and educational process of mathematical faculty of the State institution of higher education «Zaporizhzhya National University».

Стилі APA, Harvard, Vancouver, ISO та ін.

22

Huang, Shih-Yuan, and 黃詩元. "Approximate String Matching under Non-Overlapping Inversions." Thesis, 2014. http://ndltd.ncl.edu.tw/handle/91360854996893997589.

Повний текст джерела

Анотація:

碩士
國立清華大學
資訊工程學系
102
In this thesis, we introduce and study the approximate string matching problem under non-overlapping inversion distance. Given a text t, a pattern p and a non-negative integer k, the goal of the problem is to find all locations in the text t that match the pattern p with at most k non-overlapping inversions. First, we use the dynamic programming approach to design an algorithm that solves this problem in O(nm^2 ) time and O(m^2) space, where n is the length of the text and m is the length of the pattern. Next, we present another algorithm based on an efficient filtering strategy that has the same worst-case time and space complexities as the first algorithm.

Стилі APA, Harvard, Vancouver, ISO та ін.

23

Lu, Chia Wei, and 呂嘉維. "Efficient Exact and Approximate String Matching Algorithms." Thesis, 2014. http://ndltd.ncl.edu.tw/handle/49922113552090356808.

Повний текст джерела

Анотація:

博士
國立清華大學
資訊工程學系
102
In this thesis, we first propose two algorithms for exact string matching problem, which aims to find all the positions i's in a given text where a given pattern occurs. Our algorithms find an optimal selective comparing order of the characters of the pattern so that we could have a better performance in the searching phase. To find the optimal comparing order, we adopt the branch and bound approach. Moreover, our proposed algorithm can be combined with other existing exact string matching algorithms to improve the searching efficiency. The experimental results show that our algorithms indeed have the smallest number of character comparisons and are also efficient in time as compared with other existing exact string matching algorithms. Second, we propose a new filtration algorithm, as well as a hybrid filtration strategy, to efficiently solve the approximate string matching problem (also called the k-difference problem), which aims to find all the positions i's in a given text such that there exists a substring of the text ending at position i whose edit distance from a given pattern is less than or equal to a given error bound k. Our experimental results on simulated datasets of DNA sequences show that when compared with other filtration algorithms, our filtration algorithm has better performance on the efficiency to filter out those positions of the text at which the pattern does not occur approximately. Moreover, our hybrid filtration strategy further improves the effectiveness of our filtration algorithm. Third, we propose a progressive approach to solve the DNA resequencing problem which is defined as follows: We are given an unknown DNA sequence X and a known reference sequence R. Our task is to see whether X and R are similar or not. The present popular approach is to break up X into subsequences by the next generation sequencing (NGS) technologies, called reads. We then map the reads of X onto R with a suitable error bound. However, if the similarity between X and R is not very high (<95%), there would be many reads unmapped, and we then cannot obtain the mutations inside the unmapped regions. One can use a large error bound to increase the number of reads mapped. But it is not a good solution because increasing error bound will also increase the probability of false positive mapping. Our approach uses a small error bound and to increase the number of reads mapped, our approach modifies R each time after the reads are mapped. Thus our approach is a progressive approach. Compared with other available tools, our approach allows us to be able to map more reads to the reference sequence. In our simulated experiments, we also show the high correctness of our mapping algorithm.

Стилі APA, Harvard, Vancouver, ISO та ін.

24

Pan, Z. H., and 潘子豪. "An Approximate String Matching Based on an Exact String Matching with Constant Pattern Length." Thesis, 2008. http://ndltd.ncl.edu.tw/handle/45738203365580501083.

Повний текст джерела

Анотація:

碩士
國立暨南國際大學
資訊工程學系
96
The string matching problem is a very important problem which is needed to solve in computer science, computational biology and other domains. The exact matching problem is defined as follows: Given a string T whose length is n and a string P whose length is m, n≧m, find each location where string P occurs in string T. The approximate string matching problem is defined as follows: Given a pattern string P of length and a text string T of length n, n≧m, and a maximal number k of errors allowed, find each location i of T such that there exists a suffix T of T(0,i) and ED(A,B)≦k. In our thesis, we will use the exact string matching to solve the approximate string matching problem.

Стилі APA, Harvard, Vancouver, ISO та ін.

25

Pan, Zu-Hao. "An Approximate String Matching Based on an Exact String Matching with Constant Pattern Length." 2008. http://www.cetd.com.tw/ec/thesisdetail.aspx?etdun=U0020-2406200816061900.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

26

Wang, Hui Min, and 王惠民. "A Pre-processing Procedure for Approximate String Matching." Thesis, 1995. http://ndltd.ncl.edu.tw/handle/93583400034298936718.

Повний текст джерела

Анотація:

碩士
中原大學
應用數學研究所
83
Given two strings P=p1p2...pm and T=t1t2...tn over an alphabet Omega whose size is s, the edit distance d(P,T) between P and T is the minimum number of editing operations needed to convert P to T. An editing operation is either an insertion, a deletion or a change of a character. The k-differences problem is to find all substrings T'' of T such that d(P,T'')<= k, where k is a positive integer. Many researchers work on this problem. The main purpose of this paper is not proposing a new algorithm but instead of a pre-processing procedure. This procedure can be easily applied to various algorithms. The basic idea is to find the impossible occurrence by counting the number of each symbol. When the approximate strings happen rarely, this procedure performs a good result.

Стилі APA, Harvard, Vancouver, ISO та ін.

27

Burkhardt, Stefan [Verfasser]. "Filter algorithms for approximate string matching / von Stefan Burkhardt." 2004. http://d-nb.info/972323015/34.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

28

LIU, YAN-SHAO, and 劉彥劭. "Query by Humming System Based-on Approximate String Matching Technique." Thesis, 2002. http://ndltd.ncl.edu.tw/handle/96901427832257868600.

Повний текст джерела

Анотація:

碩士
國立臺灣大學
資訊工程學研究所
90
With the growth of digital music released on the Internet, the requirement of the music information retrieval system becomes more and more important. In this thesis, we proposed and realized a query-by-humming system based on string matching technique. Users can find songs they want by humming about ten seconds notes of the song. In this system, we adopt autocorrelation pitch tracking, amplitude based note segmentation, and key adjusting by key finding algorithm, to get the pitch interval and duration ratio representation from the user’s acoustic input. And different melody representations and similarity comparison methods are performed and compared. Besides, we try to use Genetic Algorithm to optimize the system parameter. Through the training process, we can find the most suitable parameter for the humming habits of the user, and it can help us to improve the query accuracy of our system.

Стилі APA, Harvard, Vancouver, ISO та ін.

29

Lu, Chia-Wei, and 呂嘉維. "String Matching Algorithms Based upon the Uniqueness Property and an Approximate String Matching Algorithm Based upon the Candidate Elimination Method." Thesis, 2008. http://ndltd.ncl.edu.tw/handle/17632385982720975496.

Повний текст джерела

Анотація:

碩士
國立暨南國際大學
資訊工程學系
96
In this thesis, we consider two problems, exact string matching and approximate string matching problem. The exact string matching problem is to determine all of the locations of a pattern string P appearing in a text string T. We first point out a uniqueness property of a given pattern. We then propose four algorithms to solve the exact string matching problem based upon the uniqueness property. Experimental results showed that three of our algorithms are faster than the KMP and Boyer-Moore algorithms. The approximate string matching problem is defined as follows: Given a text string T, a pattern string P and an error bound k, find all substrings of T whose edit distances with P are smaller than or equal to k. We give a method to eliminate candidate locations in text T as there can be no substring S starting from those locations such that the edit distance between S and pattern P is smaller than or equal to k. Our method is simple to implement. Experimental results show that our method is effective, especially in the case when we perform natural language searching.

Стилі APA, Harvard, Vancouver, ISO та ін.

30

Lu, Chia-Wei. "String Matching Algorithms Based upon the Uniqueness Property and an Approximate String Matching Algorithm Based upon the Candidate Elimination Method." 2008. http://www.cetd.com.tw/ec/thesisdetail.aspx?etdun=U0020-2406200815183100.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

31

Shieh, Y. K., and 謝一功. "An Improved Approximate String Matching Algorithm Based upon the Boyer-Moore Algorithm." Thesis, 2008. http://ndltd.ncl.edu.tw/handle/66121462520038468689.

Повний текст джерела

Анотація:

碩士
國立暨南國際大學
資訊工程學系
96
In this thesis, we discuss the exact string matching problem and approximate string matching problem. We avoid using a brute-force algorithm to solve the string matching problem. We improve the Bad Character Rule of Boyer and Moore Algorithm and compare it with the Horspool Algorithm. And we find that the improved method include the information of the Horspool Algorithm. Therefore, we combine the improved method with the Horspool Algorithm. The combinative method has a better efficiency in searching phase.

Стилі APA, Harvard, Vancouver, ISO та ін.

32

Shieh, Yi-Kung. "An Improved Approximate String Matching Algorithm Based upon the Boyer-Moore Algorithm." 2008. http://www.cetd.com.tw/ec/thesisdetail.aspx?etdun=U0020-2406200816121600.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

33

Huang, Chun-cheng, and 黃俊程. "High-Performance Parallel Location-Aware Algorithms for Approximate String Matching on GPUs." Thesis, 2017. http://ndltd.ncl.edu.tw/handle/5b596z.

Повний текст джерела

Анотація:

碩士
國立臺灣師範大學
電機工程學系
105
Approximate string matching has been widely used in many applications, including deoxyribonucleic acid sequence searching, spell checking, text mining, and spam filters. The method is designed to find all locations of strings that approximately match a pattern in accordance with the number of insertion, deletion, and substitution operations. Among the proposed algorithms, the bit-parallel algorithms are considered to be the best and highly efficient algorithms. However, the traditional bit-parallel algorithms lacks the ability of identifying the start and end positions of a matched pattern. Furthermore, acceleration of the bit-parallel algorithms has become a crucial issue for processing big data nowadays. In this paper, we propose two kinds of parallel location-aware algorithms called data-segmented parallelism and high-degree parallelism as means to accelerate approximate string matching using graphic processing units. Experimental results show that the high-degree parallelism on GPUs achieves significant improvement in system and kernel throughputs compared to CPU counterparts. Compared to state-of-the-art approaches, the proposed high-degree parallelism achieves 11 to 105 times improvement. Finally, we develop a web service which allows users to perform approximate string matching on line and deliver all matched substrings with the start and end positions.

Стилі APA, Harvard, Vancouver, ISO та ін.

34

Chen, Hui-Min, and 陳慧敏. "An Exact String Matching Problem Using Data Encoding Scheme." Thesis, 2008. http://ndltd.ncl.edu.tw/handle/99302180071058684639.

Повний текст джерела

Анотація:

碩士
國立暨南國際大學
資訊工程學系
96
The traditional exact string matching problem is to find all locations of a pattern string with length m in a text with length n. Here we propose a new encoding method to shorten the both lengths of pattern and text by substituting the substring between a special character for its length in O(m+n). Then we use an exact matching algorithm to solve the exact string matching problem on the encoding pattern and text. As can be seen、by using the encoding method、the pattern and text can be shortened about 2/|Σ| times the lengths of the original ones. In practice、it performs better than 2/|Σ|. For instance、for an English sentence pattern whose length is 50 and a text whose length is 200000、in average、the pattern is shortened to 6% of its original length and the text is shortened to 12.4% of its original length. Thus、the exact matching can be done in a much shorter time.

Стилі APA, Harvard, Vancouver, ISO та ін.

35

Chen, Hui-Min. "An Exact String Matching Problem Using Data Encoding Scheme." 2008. http://www.cetd.com.tw/ec/thesisdetail.aspx?etdun=U0020-2406200814110600.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

36

Ou, Chia-Shin, and 歐家欣. "On the Bit-Parallel Approaches to String Matching Problem." Thesis, 2014. http://ndltd.ncl.edu.tw/handle/71663047138441425921.

Повний текст джерела

Анотація:

博士
國立清華大學
資訊工程學系
102
In this thesis, we consider two problems: (1) Developing a systematical approach to solve problems involving special properties of bit-vectors (2) Developing a bit-parallel approach to solve the nearest neighbor string matching problem. For problem 1, suppose that we are given a vector consisting of only 1's and 0's and we are interested in finding some special properties of this vector. These properties all involve "for-all" or "there-exists" notations and we are interested in bit-parallel processes to find these properties. In Myers' work, a sequence of logical operations can be expressed as a logical formula: (∃k ≤ i, A(k) and ∀x∈[k, i – 1]) = (((A & B) + B) ^ B) | A) which is by no means easy to be obtained. The contribution of our research is to present a systematical method to find such logical formulas to solve problems involving bit vectors with "for-all" and "there-exists" notations. Using our way of thinking, the equation in Myers' work can be deduced step by step. Five logical prototype problems, "single-for-all", "single-there- exists", "multiple-for-all", "multiple- there-exists" and "multiple-there-exists-and-for-all", are defined in this thesis and are proved that all can be computed using bit-parallel operations in O(n/w) time, where w is the word size of the machine. We also consider four variants of the five problems and show that their logical formulas can be obtained using those of the five prototype problems systematically. The nearest neighbor string matching problem is defined as follows: Given a text string T = t1t2…tn and a pattern string P = p1p2…pm, the nearest neighbor string matching problem is to find all substrings of T whose edit distances with P are the smallest, among all substrings of T. The nearest neighbor string matching problem has useful applications in bioinformatics. It can be straight-forwardly solved by the Seller's Algorithm and the Myers Algorithm which are used to solve the approximate string matching problem. Hyyro and Navarro proposed a filtering algorithm to speed up the Myers Algorithm. However, Hyyro and Navarro's filtering approach needs to perform a pre-processing based on the error bound k which is given by the definition of approximate string matching problem. Hence, it is not suitable to be used to solve the nearest neighbor string matching problem which has no k. In this thesis, we present a modification of the Hyyro and Navarro Algorithm, and also present a bit-parallel algorithm combining the Myers Algorithm and the modified Hyyro and Navarro Algorithm. In experiments, we show that our bit-parallel algorithm is efficient.

Стилі APA, Harvard, Vancouver, ISO та ін.

37

Lin, Sheng-Yuan, and 林聖淵. "A High Performance Parallel Algorithm for Approximate String Matching on Multi-core Processor." Thesis, 2013. http://ndltd.ncl.edu.tw/handle/96412144907123635479.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

38

Hou, Kuan-Wei, and 侯冠維. "The Discrete Convolution Method for Solving the Exact String Matching Problem." Thesis, 2012. http://ndltd.ncl.edu.tw/handle/13881027671610839668.

Повний текст джерела

Анотація:

碩士
國立清華大學
電機工程學系
100
In this thesis, we introduce discrete convolution method on solving the exact string matching problem. Based on the assumption that all the text and pattern strings are generated randomly, we derived an equation which can approximate the probability of appearing of a pattern string in a text string. From this equation, we see that the probability that a pattern string appears in a text string reduces to 0 quickly as the length of the pattern string increases. Because of this observation, we introduce an algorithm based on the discrete convolution method with early termination. The algorithm terminates as soon as it discovers that a prefix of the pattern string does not appear in the text string. We show that the discrete convolution method with early termination is quite efficient to solve exact string matching problem for randomly generated text and pattern strings. In this thesis, we also show that the shift-add algorithm is equivalent to and can be implemented by the discrete convolution.

Стилі APA, Harvard, Vancouver, ISO та ін.

39

Yang, Chao-Yuan, and 楊朝淵. "A New Algorithm for Solving the String Matching with k-Mismatches Problem." Thesis, 2005. http://ndltd.ncl.edu.tw/handle/53970156684807658094.

Повний текст джерела

Анотація:

碩士
國立暨南國際大學
資訊工程學系
93
String matching with k-mismatches problem is defined as follows. Given a pattern P with length m, a text T with length n and an integer k, string matching with k-mismatches problem wants to find all occurrences of P on T with k maximal number of mismatches allowable. In this thesis, we propose a new algorithm to solve the problem. The algorithm uses the Kangaroo method mentioned in the paper by Galil and Giancarlo and that by Landau and Vishkin. Our algorithm and the Kangaroo Method are distinct from the number of steps to find mismatches. Each step in our algorithm may find two new mismatches or at least one new mismatch in some case. We can compare our algorithm to the Kangaroo Method. Each step in the Kangaroo Method only finds a mismatch. Another difference between our algorithm and the Kangaroo method is the number of positions on the text which we need to verify if those positions are occurrences of the pattern with at most k mismatches. Our algorithm would consider that some positions on the text are impossible to be occurrences of the pattern and then we don’t need to waste any time to check whether those locations are matches with at most k mismatches.

Стилі APA, Harvard, Vancouver, ISO та ін.

40

Liao, Kuei-Hui, and 廖桂慧. "Solving the Exact String Matching Problem by Using the 2-Substring Algorithm." Thesis, 2007. http://ndltd.ncl.edu.tw/handle/46723272181241712677.

Повний текст джерела

Анотація:

碩士
國立暨南國際大學
生物醫學科技研究所
95
String matching is a very important component of many problems, such as data compression, search engine, speech recognition, virus detection, computational biology, and so on. There are many efficient method proposed to solve the string matching problem. For example, KMP algorithm、Boyer-and-Moore algorithm. In this thesis, we proposed a method to solve the exact string matching problem. We proposed a rule, called the 2-substring rule, which avoids the brute force method and can be used to solve the problem. We know the time complexity of KMP algorithm is better than Boyer-and-Moore algorithm. But the practice of Boyer-and-Moore algorithm is faster than KMP algorithm. We implement our method in practice. Our method not only is faster than KMP algorithm, Boyer and Moore algorithm, Horspool algorithm and Quick Search algorithm but also is earlier to implement.

Стилі APA, Harvard, Vancouver, ISO та ін.

41

Zhong-He, Chen, and 陳中和. "The Application of Convolution to Suffix to Prefix Rule for the Exact String Matching Problem." Thesis, 2007. http://ndltd.ncl.edu.tw/handle/55234892315310089339.

Повний текст джерела

Анотація:

碩士
國立暨南國際大學
資訊工程學系
95
In this thesis, we consider the exact string matching problem. We first point out a rule, called the suffix to prefix rule, which can be used to avoid the brute-force sliding window approach. The Backward Nondeterministic DAWG Matching algorithm, Backward Oracle algorithm and Reverse Factor algorithm all use this rule. To implement this rule, we have to find the longest suffix of text T which is equal to a prefix of pattern P. In this thesis, we point out that convolution can be used to do this. As can be seen, the convolution technique is easy to understand and easy to program.

Стилі APA, Harvard, Vancouver, ISO та ін.

42

Chen, Zhong-He. "The Application of Convolution to Suffix to Prefix Rule for the Exact String Matching Problem." 2006. http://www.cetd.com.tw/ec/thesisdetail.aspx?etdun=U0020-2006200719565300.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

43

Dobiášovský, Jan. "Přibližná shoda znakových řetězců a její aplikace na ztotožňování metadat vědeckých publikací." Master's thesis, 2020. http://www.nusl.cz/ntk/nusl-415121.

Повний текст джерела

Анотація:

The thesis explores the application of approximate string matching in scientific publication record linkage process. An introduction to record matching along with five commonly used metrics for string distance (Levenshtein, Jaro, Jaro-Winkler, Cosine distances and Jaccard coefficient) are provided. These metrics are applied on publication metadata from V3S current research information system of the Czech Technical University in Prague. Based on the findings, optimal thresholds in the F1, F2 and F3-measures are determined for each metric.

Стилі APA, Harvard, Vancouver, ISO та ін.

Дисертації з теми "Approximate string matching problem"

Оформте джерело за APA, MLA, Chicago, Harvard та іншими стилями