Dissertations / Theses: 'Approximate String Matching (ASM)'

1

Cheng, Lok-lam, and 鄭樂霖. "Approximate string matching in DNA sequences." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2003. http://hub.hku.hk/bib/B29350591.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Nguyen, Hong-Thinh. "Approximate string matching distance for image classification." Thesis, Saint-Etienne, 2014. http://www.theses.fr/2014STET4029/document.

Full text

Abstract:

L'augmentation exponentielle du nombre d'images nécessite des moyens efficaces pour les classer en fonction de leur contenu visuel. Le sac de mot visuel (Bag-Of-visual-Words, BOW), en raison de sa simplicité et de sa robustesse, devient l'approche la plus populaire. Malheureusement, cette approche ne prend pas en compte de l'information spatiale, ce qui joue un rôle important dans les catégories de modélisation d'image. Récemment, Lazebnik ont introduit la représentation pyramidale spatiale (Spatial Pyramid Representation, SPR) qui a incorporé avec succès l'information spatiale dans le modèle BOW. Néanmoins, ce système de correspondance rigide empêche la SPR de gérer les variations et les transformations d'image. L'objectif principal de cette thèse est d'étudier un modèle de chaîne de correspondance plus souple qui prend l'avantage d'histogrammes de BOW locaux et se rapproche de la correspondance de la chaîne. Notre première contribution est basée sur une représentation en chaîne et une nouvelle distance d'édition (String Matching Distance, SMD) bien adapté pour les chaînes de l'histogramme qui peut calculer efficacement par programmation dynamique. Un noyau d'édition correspondant comprenant à la fois d'une pondération et d'un système pyramidal est également dérivée. La seconde contribution est une version étendue de SMD qui remplace les opérations d'insertion et de suppression par les opérations de fusion entre les symboles successifs, ce qui apporte de la souplesse labours et correspond aux images. Toutes les distances proposées sont évaluées sur plusieurs jeux de données tâche de classification et sont comparés avec plusieurs approches concurrentes
The exponential increasing of the number of images requires efficient ways to classify them based on their visual content. The most successful and popular approach is the Bag of visual Word (BoW) representation due to its simplicity and robustness. Unfortunately, this approach fails to capture the spatial image layout, which plays an important roles in modeling image categories. Recently, Lazebnik et al (2006) introduced the Spatial Pyramid Representation (SPR) which successfully incorporated spatial information into the BoW model. The idea of their approach is to split the image into a pyramidal grid and to represent each grid cell as a BoW. Assuming that images belonging to the same class have similar spatial distributions, it is possible to use a pairwise matching as similarity measurement. However, this rigid matching scheme prevents SPR to cope with image variations and transformations. The main objective of this dissertation is to study a more flexible string matching model. Keeping the idea of local BoW histograms, we introduce a new class of edit distance to compare strings of local histograms. Our first contribution is a string based image representation model and a new edit distance (called SMD for String Matching Distance) well suited for strings composed of symbols which are local BoWs. The new distance benefits from an efficient Dynamic Programming algorithm. A corresponding edit kernel including both a weighting and a pyramidal scheme is also derived. The performance is evaluated on classification tasks and compared to the standard method and several related methods. The new method outperforms other methods thanks to its ability to detect and ignore identical successive regions inside images. Our second contribution is to propose an extended version of SMD replacing insertion and deletion operations by merging operations between successive symbols. In this approach, the number of sub regions ie. the grid divisions may vary according to the visual content. We describe two algorithms to compute this merge-based distance. The first one is a greedy version which is efficient but can produce a non optimal edit script. The other one is an optimal version but it requires a 4th degree polynomial complexity. All the proposed distances are evaluated on several datasets and are shown to outperform comparable existing methods

APA, Harvard, Vancouver, ISO, and other styles

3

Marco-Sola, Santiago. "Efficient approximate string matching techniques for sequence alignment." Doctoral thesis, Universitat Politècnica de Catalunya, 2017. http://hdl.handle.net/10803/460835.

Full text

Abstract:

One of the outstanding milestones achieved in recent years in the field of biotechnology research has been the development of high-throughput sequencing (HTS). Due to the fact that at the moment it is technically impossible to decode the genome as a whole, HTS technologies read billions of relatively short chunks of a genome at random locations. Such reads then need to be located within a reference for the species being studied (that is aligned or mapped to the genome): for each read one identifies in the reference regions that share a large sequence similarity with it, therefore indicating what the read¿s point or points of origin may be. HTS technologies are able to re-sequence a human individual (i.e. to establish the differences between his/her individual genome and the reference genome for the human species) in a very short period of time. They have also paved the way for the development of a number of new protocols and methods, leading to novel insights in genomics and biology in general. However, HTS technologies also pose a challenge to traditional data analysis methods; this is due to the sheer amount of data to be processed and the need for improved alignment algorithms that can generate accurate results quickly. This thesis tackles the problem of sequence alignment as a step within the analysis of HTS data. Its contributions focus on both the methodological aspects and the algorithmic challenges towards efficient, scalable, and accurate HTS mapping. From a methodological standpoint, this thesis strives to establish a comprehensive framework able to assess the quality of HTS mapping results. In order to be able to do so one has to understand the source and nature of mapping conflicts, and explore the accuracy limits inherent in how sequence alignment is performed for current HTS technologies. From an algorithmic standpoint, this work introduces state-of-the-art index structures and approximate string matching algorithms. They contribute novel insights that can be used in practical applications towards efficient and accurate read mapping. More in detail, first we present methods able to reduce the storage space taken by indexes for genome-scale references, while still providing fast query access in order to support effective search algorithms. Second, we describe novel filtering techniques that vastly reduce the computational requirements of sequence mapping, but are nonetheless capable of giving strict algorithmic guarantees on the completeness of the results. Finally, this thesis presents new incremental algorithmic techniques able to combine several approximate string matching algorithms; this leads to efficient and flexible search algorithms allowing the user to reach arbitrary search depths. All algorithms and methodological contributions of this thesis have been implemented as components of a production aligner, the GEM-mapper, which is publicly available, widely used worldwide and cited by a sizeable body of literature. It offers flexible and accurate sequence mapping while outperforming other HTS mappers both as to running time and to the quality of the results it produces.
Uno de los avances más importantes de los últimos años en el campo de la biotecnología ha sido el desarrollo de las llamadas técnicas de secuenciación de alto rendimiento (high-throughput sequencing, HTS). Debido a las limitaciones técnicas para secuenciar un genoma, las técnicas de alto rendimiento secuencian individualmente billones de pequeñas partes del genoma provenientes de regiones aleatorias. Posteriormente, estas pequeñas secuencias han de ser localizadas en el genoma de referencia del organismo en cuestión. Este proceso se denomina alineamiento - o mapeado - y consiste en identificar aquellas regiones del genoma de referencia que comparten una alta similaridad con las lecturas producidas por el secuenciador. De esta manera, en cuestión de horas, la secuenciación de alto rendimiento puede secuenciar un individuo y establecer las diferencias de este con el resto de la especie. En última instancia, estas tecnologías han potenciado nuevos protocolos y metodologías de investigación con un profundo impacto en el campo de la genómica, la medicina y la biología en general. La secuenciación alto rendimiento, sin embargo, supone un reto para los procesos tradicionales de análisis de datos. Debido a la elevada cantidad de datos a analizar, se necesitan nuevas y mejoradas técnicas algorítmicas que puedan escalar con el volumen de datos y producir resultados precisos. Esta tesis aborda dicho problema. Las contribuciones que en ella se realizan se enfocan desde una perspectiva metodológica y otra algorítmica que propone el desarrollo de nuevos algoritmos y técnicas que permitan alinear secuencias de manera eficiente, precisa y escalable. Desde el punto de vista metodológico, esta tesis analiza y propone un marco de referencia para evaluar la calidad de los resultados del alineamiento de secuencias. Para ello, se analiza el origen de los conflictos durante la alineación de secuencias y se exploran los límites alcanzables en calidad con las tecnologías de secuenciación de alto rendimiento. Desde el punto de vista algorítmico, en el contexto de la búsqueda aproximada de patrones, esta tesis propone nuevas técnicas algorítmicas y de diseño de índices con el objetivo de mejorar la calidad y el desempeño de las herramientas dedicadas a alinear secuencias. En concreto, esta tesis presenta técnicas de diseño de índices genómicos enfocados a obtener un acceso más eficiente y escalable. También se presentan nuevas técnicas algorítmicas de filtrado con el fin de reducir el tiempo de ejecución necesario para alinear secuencias. Y, por último, se proponen algoritmos incrementales y técnicas híbridas para combinar métodos de alineamiento y mejorar el rendimiento en búsquedas donde el error esperado es alto. Todo ello sin degradar la calidad de los resultados y con garantías formales de precisión. Para concluir, es preciso apuntar que todos los algoritmos y metodologías propuestos en esta tesis están implementados y forman parte del alineador GEM. Este versátil alineador ofrece resultados de alta calidad en entornos de producción siendo varias veces más rápido que otros alineadores. En la actualidad este software se ofrece gratuitamente, tiene una amplia comunidad de usuarios y ha sido citado en numerosas publicaciones científicas.

APA, Harvard, Vancouver, ISO, and other styles

4

Mäkinen, Veli. "Parameterized approximate string matching and local-similarity-based point-pattern matching." Helsinki : University of Helsinki, 2003. http://ethesis.helsinki.fi/julkaisut/mat/tieto/vk/makinen/.

Full text

APA, Harvard, Vancouver, ISO, and other styles

5

Siragusa, Enrico [Verfasser]. "Approximate string matching for high-throughput sequencing / Enrico Siragusa." Berlin : Freie Universität Berlin, 2015. http://d-nb.info/1074404882/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

6

Keng, Leng Hui. "Approximate String Matching With Dynamic Programming and Suffix Trees." UNF Digital Commons, 2006. http://digitalcommons.unf.edu/etd/196.

Full text

Abstract:

The importance and the contribution of string matching algorithms to the modern society cannot be overstated. From basic search algorithms such as spell checking and data querying, to advanced algorithms such as DNA sequencing, trend analysis and signal processing, string matching algorithms form the foundation of many aspects in computing that have been pivotal in technological advancement. In general, string matching algorithms can be divided into the categories of exact string matching and approximate string matching. We study each area and examine some of the well known algorithms. We probe into one of the most intriguing data structure in string algorithms, the suffix tree. The lowest common ancestor extension of the suffix tree is the key to many advanced string matching algorithms. With these tools, we are able to solve string problems that were, until recently, thought intractable by many. Another interesting and relatively new data structure in string algorithms is the suffix array, which has significant breakthroughs in its linear time construction in recent years. Primarily, this thesis focuses on approximate string matching using dynamic programming and hybrid dynamic programming with suffix tree. We study both approaches in detail and see how the merger of exact string matching and approximate string matching algorithms can yield synergistic results in our experiments.

APA, Harvard, Vancouver, ISO, and other styles

7

Smith, David. "Parallel approximate string matching applied to occluded object recognition." PDXScholar, 1987. https://pdxscholar.library.pdx.edu/open_access_etds/3724.

Full text

Abstract:

This thesis develops an algorithm for approximate string matching and applies it to the problem of partially occluded object recognition. The algorithm measures the similarity of differing strings by scanning for matching substrings between strings. The length and number of matching substrings determines the amount of similarity. A classification algorithm is developed using the approximate string matching algorithm for the identification and classification of objects. A previously developed method of shape description is used for object representation.

APA, Harvard, Vancouver, ISO, and other styles

8

Dubois, Simon. "Offline Approximate String Matching forInformation Retrieval : An experiment on technical documentation." Thesis, Tekniska Högskolan, Högskolan i Jönköping, JTH. Forskningsmiljö Informationsteknik, 2013. http://urn.kb.se/resolve?urn=urn:nbn:se:hj:diva-22566.

Full text

Abstract:

Approximate string matching consists in identifying strings as similar even ifthere is a number of mismatch between them. This technique is one of thesolutions to reduce the exact matching strictness in data comparison. In manycases it is useful to identify stream variation (e.g. audio) or word declension (e.g.prefix, suffix, plural). Approximate string matching can be used to score terms in InformationRetrieval (IR) systems. The benefit is to return results even if query terms doesnot exactly match indexed terms. However, as approximate string matchingalgorithms only consider characters (nor context neither meaning), there is noguarantee that additional matches are relevant matches. This paper presents the effects of some approximate string matchingalgorithms on search results in IR systems. An experimental research design hasbeen conducting to evaluate such effects from two perspectives. First, resultrelevance is analysed with precision and recall. Second, performance is measuredthanks to the execution time required to compute matches. Six approximate string matching algorithms are studied. Levenshtein andDamerau-Levenshtein computes edit distance between two terms. Soundex andMetaphone index terms based on their pronunciation. Jaccard similarity calculatesthe overlap coefficient between two strings. Tests are performed through IR scenarios regarding to different context,information need and search query designed to query on a technicaldocumentation related to software development (man pages from Ubuntu). Apurposive sample is selected to assess document relevance to IR scenarios andcompute IR metrics (precision, recall, F-Measure). Experiments reveal that all tested approximate matching methods increaserecall on average, but, except Metaphone, they also decrease precision. Soundexand Jaccard Similarity are not advised because they fail on too many IR scenarios.Highest recall is obtained by edit distance algorithms that are also the most timeconsuming. Because Levenshtein-Damerau has no significant improvementcompared to Levenshtein but costs much more time, the last one is recommendedfor use with a specialised documentation. Finally some other related recommendations are given to practitioners toimplement IR systems on technical documentation.

APA, Harvard, Vancouver, ISO, and other styles

9

Pockrandt, Christopher Maximilian [Verfasser]. "Approximate String Matching : Improving Data Structures and Algorithms / Christopher Maximilian Pockrandt." Berlin : Freie Universität Berlin, 2019. http://d-nb.info/1183675879/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

10

Richter, Christoph Jan [Verfasser]. "Über die Vermeidung redundanter Betrachtungen beim Approximate String Matching / Christoph Jan Richter." Dortmund : Universitätsbibliothek Technische Universität Dortmund, 2005. http://d-nb.info/1011533499/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

11

Giaquinta, Emanuele. "Advancements in finite-state methods for string matching." Thesis, Università degli Studi di Catania, 2011. http://hdl.handle.net/10761/186.

Full text

Abstract:

This thesis illustrates some advancements in finite-state methods for solving the string-matching problem and some of its variants. Finite-state automata are central building blocks in string matching algorithms. In particular, the ones relevant for the present study are the Knuth-Morris-Pratt and the suffix automaton as well as their generalizations for the multiple-string-matching problem, i.e., the Aho-Corasick automaton and the automaton induced from the DAWG (Directed Acyclic Word Graph) for a set of strings. In this work I present novel encodings, based on the bit-parallelism technique, of the nondeterministic versions of the aforementioned automata and also illustrate two methods to parallelize a recent algorithm based on the nondeterministic suffix automaton. I further discuss the approximate-string-matching problem. I also show a new simple result concerning the pattern-matching-with-swaps problem and present a new distance function that is defined in terms of edit operations which involve strings rather than single characters. I also present an algorithm, based on dynamic programming and on the DAWG, to solve the approximate-string-matching problem under this distance. I finally discuss the compressed-string-matching problem and present novel results that concern searching in texts encoded with Huffman codes and with the Burrows-Wheeler transform.

APA, Harvard, Vancouver, ISO, and other styles

12

Jupin, Joseph. "Temporal Graph Record Linkage and k-Safe Approximate Match." Diss., Temple University Libraries, 2016. http://cdm16002.contentdm.oclc.org/cdm/ref/collection/p245801coll10/id/412419.

Full text

Abstract:

Computer and Information Science
Ph.D.
Since the advent of electronic data processing, organizations have accrued vast amounts of data contained in multiple databases with no reliable global unique identifier. These databases were developed by different departments for different purposes at different times. Organizing and analyzing these data for human services requires linking records from all sources. RL (Record Linkage) is a process that connects records that are related to the identical or a sufficiently similar entity from multiple heterogeneous databases. RL is a data and compute intensive, mission critical process. The process must be efficient enough to process big data and effective enough to provide accurate matches. We have evaluated an RL system that is currently in use by a local health and human services department. We found that they were using the typical approach that was offered by Fellegi and Sunter with tuple-by-tuple processing, using the Soundex as the primary approximate string matching method. The Soundex has been found to be unreliable both as a phonetic and as an approximate string matching method. We found that their data, in many cases, has more than one value per field, suggesting that the data were queried from a 5NF data base. Consider that if a woman has been married 3 times, she may have up to 4 last names on record. This query process produced more than one tuple per database/entity apparently generating a Cartesian product of this data. In many cases, more than a dozen tuples were observed for a single database/entity. This approach is both ineffective and inefficient. An effective RL method should handle this multi-data without redundancy and use edit-distance for approximate string matching. However, due to high computational complexity, edit-distance will not scale well with big data problems. We developed two methodologies for resolving the aforementioned issues: PSH and ALIM. PSH – The Probabilistic Signature Hash is a composite method that increases the speed of Damerau-Levenshtein edit-distance. It combines signature filtering, probabilistic hashing, length filtering and prefix pruning to increase the speed of edit-distance. It is also lossless because it does not lose any true positive matches. ALIM – Aggregate Link and Iterative Match is a graph-based record linkage methodology that uses a multi-graph to store demographic data about people. ALIM performs string matching as records are inserted into the graph. ALIM eliminates data redundancy and stores the relationships between data. We tested PSH for string comparison and found it to be approximately 6,000 times faster than DL. We tested it against the trie-join methods and found that they are up to 6.26 times faster but lose between 10 and 20 percent of true positives. We tested ALIM against a method currently in use by a local health and human services department and found ALIM to produce significantly more matches (even with more restrictive match criteria) and that ALIM ran more than twice as fast. ALIM handles the multi-data problem and PSH allows the use of edit-distance comparison in this RL model. ALIM is more efficient and effective than a currently implemented RL system. This model can also be expanded to perform social network analysis and temporal data modeling. For human services, temporal modeling can reveal how policy changes and treatments affect clients over time and social network analysis can determine the effects of these on whole families by facilitating family linkage.
Temple University--Theses

APA, Harvard, Vancouver, ISO, and other styles

13

Owolabi, Olumide. "Hardware and software approximate string matching and pattern recognition for intelligent knowledge based systems." Thesis, University of Strathclyde, 1988. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.280018.

Full text

APA, Harvard, Vancouver, ISO, and other styles

14

Fogla, Prahlad. "Improving the Efficiency and Robustness of Intrusion Detection Systems." Diss., Georgia Institute of Technology, 2007. http://hdl.handle.net/1853/19772.

Full text

Abstract:

With the increase in the complexity of computer systems, existing security measures are not enough to prevent attacks. Intrusion detection systems have become an integral part of computer security to detect attempted intrusions. Intrusion detection systems need to be fast in order to detect intrusions in real time. Furthermore, intrusion detection systems need to be robust against the attacks which are disguised to evade them. We improve the runtime complexity and space requirements of a host-based anomaly detection system that uses q-gram matching. q-gram matching is often used for approximate substring matching problems in a wide range of application areas, including intrusion detection. During the text pre-processing phase, we store all the q-grams present in the text in a tree. We use a tree redundancy pruning algorithm to reduce the size of the tree without losing any information. We also use suffix links for fast linear-time q-gram search during query matching. We compare our work with the Rabin-Karp based hash-table technique, commonly used for multiple q-gram matching. To analyze the robustness of network anomaly detection systems, we develop a new class of polymorphic attacks called polymorphic blending attacks, that can effectively evade payload-based network anomaly IDSs by carefully matching the statistics of the mutated attack instances to the normal profile. Using PAYL anomaly detection system for our case study, we show that these attacks are practically feasible. We develop a formal framework which is used to analyze polymorphic blending attacks for several network anomaly detection systems. We show that generating an optimal polymorphic blending attack is NP-hard for these anomaly detection systems. However, we can generate polymorphic blending attacks using the proposed approximation algorithms. The framework can also be used to improve the robustness of an intrusion detector. We suggest some possible countermeasures one can take to improve the robustness of an intrusion detection system against polymorphic blending attacks.

APA, Harvard, Vancouver, ISO, and other styles

15

Bergman, John. "Efficient fuzzy type-ahead search on big data using a ranked trie data structure." Thesis, Umeå universitet, Institutionen för fysik, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-145029.

Full text

Abstract:

The efficiency of modern search engines depends on how well they present typo-corrected results to a user while typing. So-called fuzzy type-ahead search combines fuzzy string matching and search-as-you-type functionality, and creates a powerful tool for exploring indexed data. Current fuzzy type-ahead search algorithms work well on small data sets, but for big data of social networking services such as Facebook, e-commerce sites such as Amazon, or media streaming services such as YouTube, responsive fuzzy type-ahead search remains a great challenge. This thesis describes a method that enables responsive type-ahead search combined with fuzzy string matching on big data by keeping the search time optimal for human interaction at the expense of lower accuracy for less popular records when a query contains typos. This makes the method effective for e-commerce and media services where the popularity of search terms is a result of human behaviour and thus often follow a power-law distribution.
Effektiviteten hos moderna sökmotorer beror på hur väl de presenterar rättstavade resultat för en användare medan en sökning skrivs. Så kallad fuzzy type-ahead sök kombinerar approximativ strängmatchning och sök-medan-du-skriver funktionalitet, vilket skapar ett kraftfullt verktyg för att utforska data. Dagens algoritmer för fuzzy type-ahead sök fungerar väl för små mängder data, men för data i storleksordningen “big data” från t.ex sociala nätverkstjänster så som Facebook, e-handelssidor så som Amazon, eller media tjänster så som YouTube, är en responsiv fuzzy type-ahead sök ännu en stor utmaning. Denna avhandling beskriver en metod som möjliggör responsiv type-ahead sök kombinerat med approximativ strängmatchning för big data genom att hålla söktiden optimal för mänsklig interaktion på bekostnad av lägre precision för mindre populär information när en sök-förfrågan innehåller felstavningar. Detta gör metoden effektiv för e-handel och mediatjänster där populariteten av sök-termer är ett resultat av mänskligt beteende vilket ofta följer en potens-lag distribution.

APA, Harvard, Vancouver, ISO, and other styles

16

Toth, Róbert. "Přibližné vyhledávání řetězců v předzpracovaných dokumentech." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2014. http://www.nusl.cz/ntk/nusl-236122.

Full text

Abstract:

This thesis deals with the problem of approximate string matching, also called string matching allowing errors. The thesis targets the area of offline algorithms, which allows very fast pattern matching thanks to index created during initial text preprocessing phase. Initially, we will define the problem itself and demonstrate variety of its applications, followed by short survey of different approaches to cope with this problem. Several existing algorithms based on suffix trees will be explained in detail and new hybrid algorithm will be proposed. Algorithms wil be implemented in C programming language and thoroughly compared in series of experiments with focus on newly presented algorithm.

APA, Harvard, Vancouver, ISO, and other styles

17

Neto, Domingos Soares. "Filtros para a busca e extração de padrões aproximados em cadeias biológicas." Universidade de São Paulo, 2008. http://www.teses.usp.br/teses/disponiveis/45/45134/tde-19102009-002745/.

Full text

Abstract:

Esta dissertação de mestrado aborda formulações computacionais e algoritmos para a busca e extração de padrões em cadeias biológicas. Em particular, o presente texto concentra-se nos dois problemas a seguir, considerando-os sob as distâncias de Hamming e Levenshtein: a) como determinar os locais nos quais um dado padrão ocorre de modo aproximado em uma cadeia fornecida; b) como extrair padrões que ocorram de modo aproximado em um número significativo de cadeias de um conjunto fornecido. O primeiro problema, para o qual já existem diversos algoritmos polinomiais, tem recebido muita atenção desde a década de 60, e ganhou novos ares com o advento da biologia computacional, nos idos dos anos 80, e com a popularização da Internet e seus mecanismos de busca: ambos os fenômenos trouxeram novos obstáculos a serem superados, em razão do grande volume de dados e das bastante justas restrições de tempo inerentes a essas aplicações. O segundo problema, de surgimento um pouco mais recente, é intrinsicamente desafiador, em razão de sua complexidade computacional, do tamanho das entradas tratadas nas aplicações mais comuns e de sua dificuldade de aproximação. Também é de chamar a atenção o seu grande potencial de aplicação. Neste trabalho são apresentadas formulações adequadas dos problemas abordados, assim como algoritmos e estruturas de dados essenciais ao seu estudo. Em especial, estudamos a extremamente versátil árvore dos sufixos, assim como uma de suas generalizações e sua estrutura irmã: o vetor dos sufixos. Grande parte do texto é dedicada aos filtros baseados em q-gramas para a busca aproximada de padrões e algumas de suas mais recentes variações. Estão cobertos os algoritmos bit-paralelos de Myers e Baeza-Yates-Gonnet para a busca de padrões; os algoritmos de Sagot para a extração de padrões; os algoritmos de filtragem de Ukkonen, Jokinen-Ukkonen, Burkhardt-Kärkkäinen, entre outros.
This thesis deals with computational formulations and algorithms for the extraction and search of patterns from biological strings. In particular, the present text focuses on the following problems, both considered under Hamming and Levenshtein distances: 1. How to find the positions where a given pattern approximatelly occurs in a given string; 2. How to extract patterns which approximatelly occurs in a certain number of strings from a given set. The first problem, for which there are many polinomial time algorithms, has been receiving a lot of attention since the 60s and entered a new era of discoveries with the advent of computational biology, in the 80s, and the widespread of the Internet and its search engines: both events brought new challenges to be faced by virtue of the large volume of data usually held by such applications and its time constraints. The second problem, much younger, is very challenging due to its computational complexity, approximation hardness and the size of the input data usually held by the most common applications. This problem is also very interesting due to its potential of application. In this work we show computational formulations, algorithms and data structures for those problems. We cover the bit-parallel algorithms of Myers, Baeza-Yates-Gonnet and the Sagots algorithms for patterns extraction. We also cover here the oustanding versatile suffix tree, its generalised version, and a similar data structure: the suffix array. A significant part of the present work focuses on q-gram based filters designed to solve the approximate pattern search problem. More precisely, we cover the filter algorithms of Ukkonen, Jokinen-Ukkonen and Burkhardt-Kärkkäinen, among others.

APA, Harvard, Vancouver, ISO, and other styles

18

Тодоріко, Ольга Олексіївна. "Моделі та методи очищення та інтеграції текстових даних в інформаційних системах." Thesis, Запорізький національний університет, 2016. http://repository.kpi.kharkov.ua/handle/KhPI-Press/21856.

Full text

Abstract:

Дисертація на здобуття наукового ступеня кандидата технічних наук за спеціальністю 05.13.06 – інформаційні технології. – Національний технічний університет "Харківський політехнічний інститут", Харків, 2016. У дисертаційній роботі вирішена актуальна науково-практична задача підвищення ефективності та якості технології очищення та інтеграції текстових даних в довідкових і пошукових інформаційних системах за рахунок використання моделей словозмінної парадигми та методу побудови лексемного індексу при організації пошуку за схожістю. Розроблено моделі словозмінної парадигми, що включають представлення слів та обчислення приблизної міри схожості між ними. Розроблено метод побудови лексемного індексу, що базується на запропонованих моделях словозмінної парадигми та дозволяє відобразити слово і всі його словоформи в один запис індексу. Удосконалено метод пошуку за схожістю за рахунок покращення етапу попередньої фільтрації завдяки використанню розробленої моделі словозмінної парадигми та лексемного індексу. Виконана експериментальна оцінка ефективності вказує на високу точність та 99 0,5 % повноту. Удосконалено інформаційну технологію очищення та інтеграції даних за рахунок розроблених моделей та методів. Розроблено програмну реалізацію, яка на базі запропонованих моделей та методів виконує пошук за схожістю, очищення та інтеграцію наборів даних. Одержані в роботі теоретичні та практичні результати впроваджено у виробничий процес документообігу приймальної комісії та навчальний процес математичного факультету Державного вищого навчального закладу "Запорізький національний університет".
The thesis for the candidate degree in technical sciences, speciality 05.13.06 – Information Technologies. – National Technical University "Kharkiv Polytechnic Institute", Kharkiv, 2016. In the thesis the actual scientific and practical problem of increasing the efficiency and quality of cleaning and integration of data in information reference system and information retrieval system is solved. The improvement of information technology of cleaning and integration of data is achieved by reduction of quantity of mistakes in text information by means of use of model of an inflectional paradigm, methods of creation of a lexeme index, advanced methods of tolerant retrieval. The developed model of an inflectional paradigm includes a representation of words as an ordered collection of signatures and an approximate measure of similarity between two representations. The model differs in method of dealing with forms of words and character positions. It provides the basis for the implementation of improved methods of tolerant retrieval, cleaning and integration of datasets. The method of creation of the lexeme index which is based on the offered model of an inflectional paradigm is developed, and it allows mapping a word and all its forms to a record of the index. The method of tolerant retrieval is improved at preliminary filtration stage thanks to the developed model of an inflectional paradigm and the lexeme index. The experimental efficiency evaluation indicates high precision and 99  0,5 % recall. The information technology of cleaning and integration of data is improved using the developed models and methods. The software which on the basis of the developed models and methods carries out tolerant retrieval, cleaning and integration of data sets was developed. Theoretical and practical results of the thesis are introduced in production of document flow of an entrance committee and educational process of mathematical faculty of the State institution of higher education "Zaporizhzhya National University".

APA, Harvard, Vancouver, ISO, and other styles

19

Тодоріко, Ольга Олексіївна. "Моделі та методи очищення та інтеграції текстових даних в інформаційних системах." Thesis, НТУ "ХПІ", 2016. http://repository.kpi.kharkov.ua/handle/KhPI-Press/21853.

Full text

Abstract:

Дисертація на здобуття наукового ступеня кандидата технічних наук за спеціальністю 05.13.06 – інформаційні технології. – Національний технічний університет «Харківський політехнічний інститут», Харків, 2016. У дисертаційній роботі вирішена актуальна науково-практична задача підвищення ефективності та якості технології очищення та інтеграції текстових даних в довідкових і пошукових інформаційних системах за рахунок використання моделей словозмінної парадигми та методу побудови лексемного індексу при організації пошуку за схожістю. Розроблено моделі словозмінної парадигми, що включають представлення слів та обчислення приблизної міри схожості між ними. Розроблено метод побудови лексемного індексу, що базується на запропонованих моделях словозмінної парадигми та дозволяє відобразити слово і всі його словоформи в один запис індексу. Удосконалено метод пошуку за схожістю за рахунок покращення етапу попередньої фільтрації завдяки використанню розробленої моделі словозмінної парадигми та лексемного індексу. Виконана експериментальна оцінка ефективності вказує на високу точність та 99 0,5 % повноту. Удосконалено інформаційну технологію очищення та інтеграції даних за рахунок розроблених моделей та методів. Розроблено програмну реалізацію, яка на базі запропонованих моделей та методів виконує пошук за схожістю, очищення та інтеграцію наборів даних. Одержані в роботі теоретичні та практичні результати впроваджено у виробничий процес документообігу приймальної комісії та навчальний процес математичного факультету Державного вищого навчального закладу «Запорізький національний університет».
The thesis for the candidate degree in technical sciences, speciality 05.13.06 – Information Technologies. – National Technical University «Kharkiv Polytechnic Institute», Kharkiv, 2016. In the thesis the actual scientific and practical problem of increasing the efficiency and quality of cleaning and integration of data in information reference system and information retrieval system is solved. The improvement of information technology of cleaning and integration of data is achieved by reduction of quantity of mistakes in text information by means of use of model of an inflectional paradigm, methods of creation of a lexeme index, advanced methods of tolerant retrieval. The developed model of an inflectional paradigm includes a representation of words as an ordered collection of signatures and an approximate measure of similarity between two representations. The model differs in method of dealing with forms of words and character positions. It provides the basis for the implementation of improved methods of tolerant retrieval, cleaning and integration of datasets. The method of creation of the lexeme index which is based on the offered model of an inflectional paradigm is developed, and it allows mapping a word and all its forms to a record of the index. The method of tolerant retrieval is improved at preliminary filtration stage thanks to the developed model of an inflectional paradigm and the lexeme index. The experimental efficiency evaluation indicates high precision and 99  0,5 % recall. The information technology of cleaning and integration of data is improved using the developed models and methods. The software which on the basis of the developed models and methods carries out tolerant retrieval, cleaning and integration of data sets was developed. Theoretical and practical results of the thesis are introduced in production of document flow of an entrance committee and educational process of mathematical faculty of the State institution of higher education «Zaporizhzhya National University».

APA, Harvard, Vancouver, ISO, and other styles

20

Natarajan, Santhi. "Accelerated and Accurate Alignment of Short Reads in High Throughput Next Generation Sequencing [NGS] Platforms." Thesis, 2016. http://etd.iisc.ac.in/handle/2005/4073.

Full text

Abstract:

The genome of an organism encompasses the unique set of genetic instructions for every individual in a species. The genome, in totality, guides the course of evolution, development, genetic and epigenetic growth factors of an individual. Genomics, the study of genome, presents an interdisciplinary landscape, with a multistage data analytics pipeline. Understanding the genome involves determining the order of the four constituent nucleotides or bases or genetic alphabets, namely adenine (A), cytosine (C), guanine (G) and thymine (T), within the genome’s DNA sequence, and the process is widely known as sequencing. Next Generation Sequencing (NGS) involves massively parallel sequencing of genetic data with high throughput. NGS oﬀers an unparalleled interrogation of the genome, throwing deeper insight into the functional and structural investigation of genetic data. The deductions from such a study leave a huge impact across fields, including medical diagnostics, therapeutics and drug discovery, and as well form the basis for genomic medicine. Data processing with NGS happens over an elaborated multi-stage data analytics pipeline. During the primary data analysis, the sequencing process produces billions of short fragments, called short reads, of the target genome. This amounts to petabytes of unprocessed genomic raw data. Short read mapping (SRM) is the process of mapping these short reads to their respective positions in the target genome. Due to the sheer volume of data that needs to be handled, SRM serves as a major sequential bottleneck to the NGS data analytics pipeline in genomics, and presents profound technical and computing challenges. Classified as a complex big data engineering problem, SRM thus calls for innovative computational, scientific and statistical approaches towards big data analysis. A strict validation of various algorithms and softwares in an NGS pipeline is essential, to ensure reliable and accurate results. With growing volume of NGS big data, the SRM and subsequent analytic steps de-mand a High Performance Computing (HPC) environment for data storage and analyses. Existing solutions for accelerating SRM provide notable performance, while leveraging heuristics and incurring significant error rates. Given the impact of the results of SRM in subsequent diagnostics and therapeutics, such heuristics and error rates are not affordable. In this context, we need precise, aﬀordable, reliable and actionable results from SRM, to support any application, with uncompromised accuracy and performance.In this work, we present a massively parallel and scalable archetype, for accurate alignment of short reads, at a fine-grained single nucleotide resolution. The significant contributions of this work are presented below: 1. We present a robust and eﬃcient indexing scheme for the reference genome, which is devoid of heuristics. The scheme reports all possible regions of mapping for a short read, inclusive of repeat regions. The lookup scheme eﬃciently handles the redundancy in reads. Though this leaves the rest of the pipeline with more data for SRM as compared to the heuristic solutions, it provides the end user with reliable and actionable results. 2. We present an eﬃcient parallel implementation of an accurate sequence alignment algorithm based on the Dynamic Programming (DP) method. Our alignment kernels can seamlessly perform the traceback process in hardware simultaneously with the forward scan, thus eliminating the computational and memory bottlenecks associated with such algorithms. These kernels thus report alignment in a minimum deterministic time, which forms the first level of acceleration for SRM. 3. We present AccuRA, a hardware accelerator targeting reconfigurable hardware platforms. The model scales well at multiple levels of granularity, which precisely aligns short reads, at a fine-grained single nucleotide resolution, and oﬀers full coverage of the genome. 4. We present GMAccS, a GPGPU based solution, for the SRM accelerator. This employs a platform independent model, capable of targeting a heterogeneous set of GPU hardware. 5. We present a performance and scalability analysis model for both the archetypes. The results from the prototypes substantiate the scalability of these architectures at multiple levels of granularity. 6. We present the generalization of our solution across applications which exhibit similar data patterns in terms of volume, variety, rate of production and analysis, randomness and uncertainty involved in data, and use Approximate String Matching (ASM) as the fundamental operation for data analytics. 7. We present the various problems within the biological domain, in terms of complexity and quantity of data, to which our solution can be customized and scaled, at various levels of granularity. We have presented the results from various prototype models of both AccuRA and GMAccS. The AccuRA prototype, hosting eight kernel units on a single reconfigurable device, aligns short reads with an alignment performance of 20.48 Giga Cell Updates Per Second (GCUPs). AccuRA can be ported onto devices as diverse as SoCs, ASICs or reconfigurable platform based hardware coprocessors or accelerators. The scalability analysis proved to substantiate the parallel AccuRA architecture, making it a promising target to accelerate the SRM process in the NGS pipeline. The in-house supercomputing platform SahasraT, which is a Cray XC40 system, hosted the prototype for the GMAccS archetype. The GMAccS prototypes align with an optimal performance of 23.69 Million Maps Per Second (MMPS) to 528.69 MMPS, while scaling from a single GPU to 24 GPUs. The performance model for GMAccS, as well as the results from the prototypes, substantiates the scalability of the GMAccS archetype and the subsequent performance enhancement achieved by it. Both AccuRA and GMAccS accommodate the big data of genomics, with uncompromised accuracy, precision and performance, while aligning the smaller archeal, bacterial and fungal genomes, to the much larger mammalian human genomes. These models have successfully handled redundant reads and multiread alignments. The results from AccuRA and GMAccS are available in the Sequence Alignment/Map (SAM) format, making it compatible with the downstream applications in the NGS pipeline. With a basic parameterized SRM model, and the results from its various prototypes for small and large genome benchmarks, we have gained the confidence that this solution can serve the requirements of accurate and scalable alignment of NGS big data. We believe that our model can serve as a reliable candidate in the future of genomics, called the "genomic highway", where data belonging to multiple applications can be streamed in, and can be aligned real time, with minimal memory and storage requirements, on a generalized alignment engine.

APA, Harvard, Vancouver, ISO, and other styles

21

Huang, Shih-Yuan, and 黃詩元. "Approximate String Matching under Non-Overlapping Inversions." Thesis, 2014. http://ndltd.ncl.edu.tw/handle/91360854996893997589.

Full text

Abstract:

碩士
國立清華大學
資訊工程學系
102
In this thesis, we introduce and study the approximate string matching problem under non-overlapping inversion distance. Given a text t, a pattern p and a non-negative integer k, the goal of the problem is to find all locations in the text t that match the pattern p with at most k non-overlapping inversions. First, we use the dynamic programming approach to design an algorithm that solves this problem in O(nm^2 ) time and O(m^2) space, where n is the length of the text and m is the length of the pattern. Next, we present another algorithm based on an efficient filtering strategy that has the same worst-case time and space complexities as the first algorithm.

APA, Harvard, Vancouver, ISO, and other styles

22

Lu, Chia Wei, and 呂嘉維. "Efficient Exact and Approximate String Matching Algorithms." Thesis, 2014. http://ndltd.ncl.edu.tw/handle/49922113552090356808.

Full text

Abstract:

博士
國立清華大學
資訊工程學系
102
In this thesis, we first propose two algorithms for exact string matching problem, which aims to find all the positions i's in a given text where a given pattern occurs. Our algorithms find an optimal selective comparing order of the characters of the pattern so that we could have a better performance in the searching phase. To find the optimal comparing order, we adopt the branch and bound approach. Moreover, our proposed algorithm can be combined with other existing exact string matching algorithms to improve the searching efficiency. The experimental results show that our algorithms indeed have the smallest number of character comparisons and are also efficient in time as compared with other existing exact string matching algorithms. Second, we propose a new filtration algorithm, as well as a hybrid filtration strategy, to efficiently solve the approximate string matching problem (also called the k-difference problem), which aims to find all the positions i's in a given text such that there exists a substring of the text ending at position i whose edit distance from a given pattern is less than or equal to a given error bound k. Our experimental results on simulated datasets of DNA sequences show that when compared with other filtration algorithms, our filtration algorithm has better performance on the efficiency to filter out those positions of the text at which the pattern does not occur approximately. Moreover, our hybrid filtration strategy further improves the effectiveness of our filtration algorithm. Third, we propose a progressive approach to solve the DNA resequencing problem which is defined as follows: We are given an unknown DNA sequence X and a known reference sequence R. Our task is to see whether X and R are similar or not. The present popular approach is to break up X into subsequences by the next generation sequencing (NGS) technologies, called reads. We then map the reads of X onto R with a suitable error bound. However, if the similarity between X and R is not very high (<95%), there would be many reads unmapped, and we then cannot obtain the mutations inside the unmapped regions. One can use a large error bound to increase the number of reads mapped. But it is not a good solution because increasing error bound will also increase the probability of false positive mapping. Our approach uses a small error bound and to increase the number of reads mapped, our approach modifies R each time after the reads are mapped. Thus our approach is a progressive approach. Compared with other available tools, our approach allows us to be able to map more reads to the reference sequence. In our simulated experiments, we also show the high correctness of our mapping algorithm.

APA, Harvard, Vancouver, ISO, and other styles

23

Pan, Z. H., and 潘子豪. "An Approximate String Matching Based on an Exact String Matching with Constant Pattern Length." Thesis, 2008. http://ndltd.ncl.edu.tw/handle/45738203365580501083.

Full text

Abstract:

碩士
國立暨南國際大學
資訊工程學系
96
The string matching problem is a very important problem which is needed to solve in computer science, computational biology and other domains. The exact matching problem is defined as follows: Given a string T whose length is n and a string P whose length is m, n≧m, find each location where string P occurs in string T. The approximate string matching problem is defined as follows: Given a pattern string P of length and a text string T of length n, n≧m, and a maximal number k of errors allowed, find each location i of T such that there exists a suffix T of T(0,i) and ED(A,B)≦k. In our thesis, we will use the exact string matching to solve the approximate string matching problem.

APA, Harvard, Vancouver, ISO, and other styles

24

Pan, Zu-Hao. "An Approximate String Matching Based on an Exact String Matching with Constant Pattern Length." 2008. http://www.cetd.com.tw/ec/thesisdetail.aspx?etdun=U0020-2406200816061900.

Full text

APA, Harvard, Vancouver, ISO, and other styles

25

Wang, Hui Min, and 王惠民. "A Pre-processing Procedure for Approximate String Matching." Thesis, 1995. http://ndltd.ncl.edu.tw/handle/93583400034298936718.

Full text

Abstract:

碩士
中原大學
應用數學研究所
83
Given two strings P=p1p2...pm and T=t1t2...tn over an alphabet Omega whose size is s, the edit distance d(P,T) between P and T is the minimum number of editing operations needed to convert P to T. An editing operation is either an insertion, a deletion or a change of a character. The k-differences problem is to find all substrings T'' of T such that d(P,T'')<= k, where k is a positive integer. Many researchers work on this problem. The main purpose of this paper is not proposing a new algorithm but instead of a pre-processing procedure. This procedure can be easily applied to various algorithms. The basic idea is to find the impossible occurrence by counting the number of each symbol. When the approximate strings happen rarely, this procedure performs a good result.

APA, Harvard, Vancouver, ISO, and other styles

26

Burkhardt, Stefan [Verfasser]. "Filter algorithms for approximate string matching / von Stefan Burkhardt." 2004. http://d-nb.info/972323015/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

27

LIU, YAN-SHAO, and 劉彥劭. "Query by Humming System Based-on Approximate String Matching Technique." Thesis, 2002. http://ndltd.ncl.edu.tw/handle/96901427832257868600.

Full text

Abstract:

碩士
國立臺灣大學
資訊工程學研究所
90
With the growth of digital music released on the Internet, the requirement of the music information retrieval system becomes more and more important. In this thesis, we proposed and realized a query-by-humming system based on string matching technique. Users can find songs they want by humming about ten seconds notes of the song. In this system, we adopt autocorrelation pitch tracking, amplitude based note segmentation, and key adjusting by key finding algorithm, to get the pitch interval and duration ratio representation from the user’s acoustic input. And different melody representations and similarity comparison methods are performed and compared. Besides, we try to use Genetic Algorithm to optimize the system parameter. Through the training process, we can find the most suitable parameter for the humming habits of the user, and it can help us to improve the query accuracy of our system.

APA, Harvard, Vancouver, ISO, and other styles

28

Lu, Chia-Wei, and 呂嘉維. "String Matching Algorithms Based upon the Uniqueness Property and an Approximate String Matching Algorithm Based upon the Candidate Elimination Method." Thesis, 2008. http://ndltd.ncl.edu.tw/handle/17632385982720975496.

Full text

Abstract:

碩士
國立暨南國際大學
資訊工程學系
96
In this thesis, we consider two problems, exact string matching and approximate string matching problem. The exact string matching problem is to determine all of the locations of a pattern string P appearing in a text string T. We first point out a uniqueness property of a given pattern. We then propose four algorithms to solve the exact string matching problem based upon the uniqueness property. Experimental results showed that three of our algorithms are faster than the KMP and Boyer-Moore algorithms. The approximate string matching problem is defined as follows: Given a text string T, a pattern string P and an error bound k, find all substrings of T whose edit distances with P are smaller than or equal to k. We give a method to eliminate candidate locations in text T as there can be no substring S starting from those locations such that the edit distance between S and pattern P is smaller than or equal to k. Our method is simple to implement. Experimental results show that our method is effective, especially in the case when we perform natural language searching.

APA, Harvard, Vancouver, ISO, and other styles

29

Lu, Chia-Wei. "String Matching Algorithms Based upon the Uniqueness Property and an Approximate String Matching Algorithm Based upon the Candidate Elimination Method." 2008. http://www.cetd.com.tw/ec/thesisdetail.aspx?etdun=U0020-2406200815183100.

Full text

APA, Harvard, Vancouver, ISO, and other styles

30

Shieh, Y. K., and 謝一功. "An Improved Approximate String Matching Algorithm Based upon the Boyer-Moore Algorithm." Thesis, 2008. http://ndltd.ncl.edu.tw/handle/66121462520038468689.

Full text

Abstract:

碩士
國立暨南國際大學
資訊工程學系
96
In this thesis, we discuss the exact string matching problem and approximate string matching problem. We avoid using a brute-force algorithm to solve the string matching problem. We improve the Bad Character Rule of Boyer and Moore Algorithm and compare it with the Horspool Algorithm. And we find that the improved method include the information of the Horspool Algorithm. Therefore, we combine the improved method with the Horspool Algorithm. The combinative method has a better efficiency in searching phase.

APA, Harvard, Vancouver, ISO, and other styles

31

Shieh, Yi-Kung. "An Improved Approximate String Matching Algorithm Based upon the Boyer-Moore Algorithm." 2008. http://www.cetd.com.tw/ec/thesisdetail.aspx?etdun=U0020-2406200816121600.

Full text

APA, Harvard, Vancouver, ISO, and other styles

32

Huang, Chun-cheng, and 黃俊程. "High-Performance Parallel Location-Aware Algorithms for Approximate String Matching on GPUs." Thesis, 2017. http://ndltd.ncl.edu.tw/handle/5b596z.

Full text

Abstract:

碩士
國立臺灣師範大學
電機工程學系
105
Approximate string matching has been widely used in many applications, including deoxyribonucleic acid sequence searching, spell checking, text mining, and spam filters. The method is designed to find all locations of strings that approximately match a pattern in accordance with the number of insertion, deletion, and substitution operations. Among the proposed algorithms, the bit-parallel algorithms are considered to be the best and highly efficient algorithms. However, the traditional bit-parallel algorithms lacks the ability of identifying the start and end positions of a matched pattern. Furthermore, acceleration of the bit-parallel algorithms has become a crucial issue for processing big data nowadays. In this paper, we propose two kinds of parallel location-aware algorithms called data-segmented parallelism and high-degree parallelism as means to accelerate approximate string matching using graphic processing units. Experimental results show that the high-degree parallelism on GPUs achieves significant improvement in system and kernel throughputs compared to CPU counterparts. Compared to state-of-the-art approaches, the proposed high-degree parallelism achieves 11 to 105 times improvement. Finally, we develop a web service which allows users to perform approximate string matching on line and deliver all matched substrings with the start and end positions.

APA, Harvard, Vancouver, ISO, and other styles

33

Lin, Sheng-Yuan, and 林聖淵. "A High Performance Parallel Algorithm for Approximate String Matching on Multi-core Processor." Thesis, 2013. http://ndltd.ncl.edu.tw/handle/96412144907123635479.

Full text

APA, Harvard, Vancouver, ISO, and other styles

34

Dobiášovský, Jan. "Přibližná shoda znakových řetězců a její aplikace na ztotožňování metadat vědeckých publikací." Master's thesis, 2020. http://www.nusl.cz/ntk/nusl-415121.

Full text

Abstract:

The thesis explores the application of approximate string matching in scientific publication record linkage process. An introduction to record matching along with five commonly used metrics for string distance (Levenshtein, Jaro, Jaro-Winkler, Cosine distances and Jaccard coefficient) are provided. These metrics are applied on publication metadata from V3S current research information system of the Czech Technical University in Prague. Based on the findings, optimal thresholds in the F1, F2 and F3-measures are determined for each metric.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Approximate String Matching (ASM)'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles