Dissertations / Theses on the topic 'Sequenze biologiche'

To see the other types of publications on this topic, follow the link: Sequenze biologiche.

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 dissertations / theses for your research on the topic 'Sequenze biologiche.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Zappala', Domenica. "Espressione di diverse sequenze geniche del Polyomavirus JC nel soggetto immunocompromesso." Doctoral thesis, Università di Catania, 2012. http://hdl.handle.net/10761/1091.

Full text
Abstract:
E' noto che il sistema immunitario rappresenta la base per la protezione dell'organismo dalle infezioni e quindi ogni suo deficit facilita l'insorgenza di malattie infettive e le rende più gravi. Lo stato di immunodepressione, in cui possono trovarsi alcuni soggetti a causa di diversi eventi patologici, diventa il presupposto per la riattivazione di agenti patogeni virali già presenti in forma latente nell organismo. Il JCV è un polyomavirus ubiquitario che infetta l uomo in età pediatrica e permane latente, dopo la prima infezione, nell organismo ospite, alternandosi talvolta ad episodi di attiva replicazione e stati di quiescenza a seconda della capacità reattiva del soggetto infetto. Nonostante la comparsa degli anticorpi, il virus non viene eliminato dall organismo ma rimane latente nel rene, nel midollo osseo, nelle cellule del tessuto nervoso, nei linfonodi e nell epitelio intestinale, rendendo l ospite portatore sano fino ad un eventuale riattivazione. In condizioni di severa immunosoppressione il virus potrebbe riattivare e indurre una fatale malattia demielinizzante conosciuta come Leucoencefalopatia Multifocale Progressiva (PML). Il meccanismo della riattivazione sembra essere strettamente legato ai processi di replicazione e all espressione di particolari sequenze geniche. E stato, quindi, oggetto di questo studio, la valutazione della presenza del DNA di JCV in termini di espressione genica di due differenti regioni del virus: la regione precoce Large-T (early) o l introne late mRNA di mVP1/mVP2 (late) di JCV in cinque distinti gruppi di soggetti immunodepressi. Sono stati analizzati 200 campioni di plasma di pazienti ricoverati presso le U.O. di ematologia, trapianti, gastroenterologia appartenenti a diversi nosocomi catanesi. Inoltre venivano inclusi 55 campioni bioptici a fresco e paraffinati provenienti da un numero corrispondente di pazienti affetti da RCU (mucosa intestinale), appartenenti a soggetti con forme precancerose o cancro del colon (formazione neoplastica) e trapiantati di rene (rene trapiantato). Per lo studio retrospettivo sono state applicate metodiche di Nested-PCR e Real-Time PCR sia per confermare la presenza del DNA virale di JCV sia per la valutazione dell espressione delle due sequenze geniche ricercate. Di tutti i plasma analizzati solo il 26% (52/200) risultava negativo, per gli altri si poteva apprezzare una positiva solo alla regione tardiva del 38,5% (77/200) e una positività per la sola regione precoce del 25% (51/200). La copresenza delle due regioni ricercate si notava nel 10% dei casi (20/200). Data la prevalenza di positività alla regione tardiva, sembra che, nonostante la regione early sia una sequenza di riferimento diagnostico, la sequenza late rivesta un ruolo fondamentale nella diagnosi di tale tipologia di soggetti confermato altresì da un associazione statisticamente significativa tra le due regioni (p<0,05). Per quanto riguarda i campioni bioptici, si aveva positività solo alle sequenza VP1/VP2; ciò potrebbe dipendere dai meccanismi di replicazione che si instaurano in seguito allo stato di latenza o riattivazione del virus nelle cellule per esso non permissive come nel caso delle cellule dell epitelio intestinale. Poiché l espressione di Large-T e VP1/VP2 è strettamente correlata al completamento del ciclo virale, il loro reperimento dipende dalle diverse fasi della replicazione. Il nuovo bersaglio diagnostico, quindi, confrontato ed affiancato a quello tradizionale, potrebbe chiarire l evoluzione delle patologie connesse a questo virus. La ricerca di due regioni differenti del virus potrebbe essere di aiuto nel chiarire la diagnosi e, laddove fosse in corso una terapia, permettere un corretto monitoraggio e l ottimizzazione dell intervento terapeutico.
APA, Harvard, Vancouver, ISO, and other styles
2

Fortino, Vittorio. "Sequence analysis in bioinformatics: methodological and practical aspects." Doctoral thesis, Universita degli studi di Salerno, 2013. http://hdl.handle.net/10556/985.

Full text
Abstract:
2011 - 2012
My PhD research activities has focused on the development of new computational methods for biological sequence analyses. To overcome an intrinsic problem to protein sequence analysis, whose aim was to infer homologies in large biological protein databases with short queries, I developed a statistical framework BLAST-based to detect distant homologies conserved in transmembrane domains of different bacterial membrane proteins. Using this framework, transmembrane protein domains of all Salmonella spp. have been screened and more than five thousands of significant homologies have been identified. My results show that the proposed framework detects distant homologies that, because of their conservation in distinct bacterial membrane proteins, could represent ancient signatures about the existence of primeval genetic elements (or mini-genes) coding for short polypeptides that formed, through a primitive assembly process, more complex genes. Further, my statistical framework lays the foundation for new bioinformatics tools to detect homologies domain-oriented, or in other words, the ability to find statistically significant homologies in specific target-domains. The second problem that I faced deals with the analysis of transcripts obtained with RNA-Seq data. I developed a novel computational method that combines transcript borders, obtained from mapped RNA-Seq reads, with sequence features based operon predictions to accurately infer operons in prokaryotic genomes. Since the transcriptome of an organism is dynamic and condition dependent, the RNA-Seq mapped reads are used to determine a set of confirmed or predicted operons and from it specific transcriptomic features are extracted and combined with standard genomic features to train and validate three operon classification models (Random Forests - RFs, Neural Networks – NNs, and Support Vector Machines - SVMs). These classifiers have been exploited to refine the operon map annotated by DOOR, one of the most used database of prokaryotic operons. This method proved that the integration of genomic and transcriptomic features improve the accuracy of operon predictions, and that it is possible to predict the existence of potential new operons. An inherent limitation of using RNA-Seq to improve operon structure predictions is that it can be not applied to genes not expressed under the condition studied. I evaluated my approach on different RNA-Seq based transcriptome profiles of Histophilus somni and Porphyromonas gingivalis. These transcriptome profiles were obtained using the standard RNA-Seq or the strand-specific RNA-Seq method. My experimental results demonstrate that the three classifiers achieved accurate operon maps including reliable predictions of new operons. [edited by author]
XI n.s.
APA, Harvard, Vancouver, ISO, and other styles
3

Seth, Pawan. "STUDY OF THE RELATIONSHIP BETWEEN Mus musculus PROTEIN SEQUENCES AND THEIR BIOLOGICAL FUNCTIONS." University of Akron / OhioLINK, 2007. http://rave.ohiolink.edu/etdc/view?acc_num=akron1176736255.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Arvestad, Lars. "Algorithms for biological sequence alignment." Doctoral thesis, KTH, Numerisk analys och datalogi, NADA, 1999. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-2905.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Altschul, Stephen Frank. "Aspects of biological sequence comparison." Thesis, Massachusetts Institute of Technology, 1987. http://hdl.handle.net/1721.1/102708.

Full text
Abstract:
Thesis (Ph. D)--Massachusetts Institute of Technology, Dept. of Mathematics, 1987.
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Bibliography: leaves 165-168.
by Stephen Frank Altschul.
Ph.D
APA, Harvard, Vancouver, ISO, and other styles
6

Yeats, Corin Anthony. "Biological investigations through sequence analysis." Thesis, University of Cambridge, 2005. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.614848.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Pustułka-Hunt, Elżbieta Katarzyna. "Biological sequence indexing using persistent Java." Thesis, University of Glasgow, 2001. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.270957.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Xu, Keyuan. "Stochastic modeling of biological sequence evolution." Thesis, Massachusetts Institute of Technology, 2005. http://hdl.handle.net/1721.1/32113.

Full text
Abstract:
Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2005.
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Includes bibliographical references (leaves 81-86).
Markov models of sequence evolution are a fundamental building block for making inferences in biological research. This thesis reviews several major techniques developed to estimate parameters of Markov models of sequence evolution and presents a new approach for evaluating and comparing estimation techniques. Current methods for evaluating estimation techniques require sequence data from populations with well-known phylogenetic relationships. Such data is not always available since phylogenetic relationships can never be known with certainty. We propose generating sequence data for the purpose of estimation technique evaluation by simulating sequence evolution in a controlled setting. Our elementary simulator uses a Markov model and a binary branching process, which dynamically builds a phylogenetic tree from an initial seed sequence. The sequences at the leaves of the tree can then be used as input to estimation techniques. We demonstrate our evaluation approach on Arvestad and Bruno's estimation method, and show how our approach can reveal performance variations empirically. The results of our simulation can be used as a guide towards improving estimation techniques.
by Keyuan Xu.
M.Eng.
APA, Harvard, Vancouver, ISO, and other styles
9

Murrel, Benjamin. "Improved models of biological sequence evolution." Thesis, Stellenbosch : Stellenbosch University, 2012. http://hdl.handle.net/10019.1/71870.

Full text
Abstract:
Thesis (PhD)--Stellenbosch University, 2012.
ENGLISH ABSTRACT: Computational molecular evolution is a field that attempts to characterize how genetic sequences evolve over phylogenetic trees – the branching processes that describe the patterns of genetic inheritance in living organisms. It has a long history of developing progressively more sophisticated stochastic models of evolution. Through a probabilist’s lens, this can be seen as a search for more appropriate ways to parameterize discrete state continuous time Markov chains to better encode biological reality, matching the historical processes that created empirical data sets, and creating useful tools that allow biologists to test specific hypotheses about the evolution of the organisms or the genes that interest them. This dissertation is an attempt to fill some of the gaps that persist in the literature, solving what we see as existing open problems. The overarching theme of this work is how to better model variation in the action of natural selection at multiple levels: across genes, between sites, and over time. Through four published journal articles and a fifth in preparation, we present amino acid and codon models that improve upon existing approaches, providing better descriptions of the process of natural selection and better tools to detect adaptive evolution.
AFRIKAANSE OPSOMMING: Komputasionele molekulêre evolusie is ’n navorsingsarea wat poog om die evolusie van genetiese sekwensies oor filogenetiese bome – die vertakkende prosesse wat die patrone van genetiese oorerwing in lewende organismes beskryf – te karakteriseer. Dit het ’n lang geskiedenis waartydens al hoe meer gesofistikeerde waarskynlikheidsmodelle van evolusie ontwikkel is. Deur die lens van waarskynlikheidsleer kan hierdie proses gesien word as ’n soektog na meer gepasde metodes om diskrete-toestand kontinuë-tyd Markov kettings te parametriseer ten einde biologiese realiteit beter te enkodeer – op so ’n manier dat die historiese prosesse wat tot die vorming van biologiese sekwensies gelei het nageboots word, en dat nuttige metodes geskep word wat bioloë toelaat om spesifieke hipotesisse met betrekking tot die evolusie van belanghebbende organismes of gene te toets. Hierdie proefskrif is ’n poging om sommige van die gapings wat in die literatuur bestaan in te vul en bestaande oop probleme op te los. Die oorkoepelende tema is verbeterde modellering van variasie in die werking van natuurlike seleksie op verskeie vlakke: variasie van geen tot geen, variasie tussen posisies in gene en variasie oor tyd. Deur middel van vier gepubliseerde joernaalartikels en ’n vyfde artikel in voorbereiding, bied ons aminosuur- en kodon-modelle aan wat verbeter op bestaande benaderings – hierdie modelle verskaf beter beskrywings van die proses van natuurlike seleksie sowel as beter metodes om gevalle van aanpassing in evolusie te vind.
APA, Harvard, Vancouver, ISO, and other styles
10

Gîrdea, Marta. "New methods for biological sequence alignment." Thesis, Lille 1, 2010. http://www.theses.fr/2010LIL10089/document.

Full text
Abstract:
L'alignement de séquences biologiques est une technique fondamentale en bioinformatique, et consiste à identifier des séries de caractères similaires (conservés) qui apparaissent dans le même ordre dans les deux séquences, et à inférer un ensemble de modifications (substitutions, insertions et suppressions) impliquées dans la transformation d'une séquence en l'autre. Cette technique permet de déduire, sur la base de la similarité de séquence, si deux ou plusieurs séquences biologiques sont potentiellement homologues, donc si elles partagent un ancêtre commun, permettant ainsi de mieux comprendre l'évolution des séquences. Cette thèse aborde les problèmes de comparaison de séquences dans deux cadres différents: la détection d'homologies et le séquençage à haut débit. L'objectif de ce travail est de développer des méthodes d'alignement qui peuvent apporter des solutions aux deux problèmes suivants: i) la détection d'homologies cachées entre des protéines par comparaison de séquences protéiques, lorsque la source de leur divergence sont les mutations qui changent le cadre de lecture, et ii) le mapping de reads SOLiD (séquences de di-nucléotides chevauchantes codés par des couleurs) sur un génome de référence. Dans les deux cas, la même idée générale est appliquée: comparer implicitement les séquences d'ADN pour la détection de changements qui se produisent à ce niveau, en manipulant, en pratique, d'autres représentations (séquences de protéines, séquences de codes di-nucléotides) qui fournissent des informations supplémentaires et qui aident à améliorer la recherche de similarités. Le but est de concevoir et d'appliquer des méthodes exactes et heuristiques d'alignement, ainsi que des systemes de scores, adaptés à ces scénarios
Biological sequence alignment is a fundamental technique in bioinformatics, and consists of identifying series of similar (conserved) characters that appear in the same order in both sequences, and eventually deducing a set of modifications (substitutions, insertions and deletions) involved in the transformation of one sequence into the other. This technique allows one to infer, based on sequence similarity, if two or more biological sequences are potentially homologous, i.e. if they share a common ancestor, thus enabling the understanding of sequence evolution.This thesis addresses sequence comparison problems in two different contexts: homology detection and high throughput DNA sequencing. The goal of this work is to develop sensitive alignment methods that provide solutions to the following two problems: i) the detection of hidden protein homologies by protein sequence comparison, when the source of the divergence are frameshift mutations, and ii) mapping short SOLiD reads (sequences of overlapping di-nucleotides encoded as colors) to a reference genome. In both cases, the same general idea is applied: to implicitly compare DNA sequences for detecting changes occurring at this level, while manipulating, in practice, other representations (protein sequences, sequences of di-nucleotide codes) that provide additional information and thus help to improve the similarity search. The aim is to design and implement exact and heuristic alignment methods, along with scoring schemes, adapted to these scenarios
APA, Harvard, Vancouver, ISO, and other styles
11

Orobitg, Cortada Miquel. "High performance computing on biological sequence alignment." Doctoral thesis, Universitat de Lleida, 2013. http://hdl.handle.net/10803/110930.

Full text
Abstract:
L'Alineament Múltiple de Seqüències (MSA) és una eina molt potent per a aplicacions biològiques importants. Els MSA són computacionalment complexos de calcular, i la majoria de les formulacions porten a problemes d'optimització NP-Hard. Per a dur a terme alineaments de milers de seqüències, nous desafiaments necessiten ser resolts per adaptar els algoritmes a l'era de la computació d'altes prestacions. En aquesta tesi es proposen tres aportacions diferents per resoldre algunes limitacions dels mètodes MSA. La primera proposta consisteix en un algoritme de construcció d'arbres guia per millorar el grau de paral•lelisme, amb la finalitat de resoldre el coll d'ampolla de l'etapa de l'alineament progressiu. La segona proposta consisteix en optimitzar la biblioteca de consistència per millorar el temps d'execució, l'escalabilitat, i poder tractar un major nombre de seqüències. Finalment, proposem Multiples Trees Alignment (MTA), un mètode MSA per alinear en paral•lel múltiples arbres guia, avaluar els alineaments obtinguts i seleccionar el millor com a resultat. Els resultats experimentals han demostrat que MTA millora considerablement la qualitat dels alineaments. El Alineamiento Múltiple de Secuencias (MSA) es una herramienta poderosa para aplicaciones biológicas importantes. Los MSA son computacionalmente complejos de calcular, y la mayoría de las formulaciones llevan a problemas de optimización NP-Hard. Para llevar a cabo alineamientos de miles de secuencias, nuevos desafíos necesitan ser resueltos para adaptar los algoritmos a la era de la computación de altas prestaciones. En esta tesis se proponen tres aportaciones diferentes para resolver algunas limitaciones de los métodos MSA. La primera propuesta consiste en un algoritmo de construcción de árboles guía para mejorar el grado de paralelismo, con el fin de resolver el cuello de botella de la etapa del alineamiento progresivo. La segunda propuesta consiste en optimizar la biblioteca de consistencia para mejorar el tiempo de ejecución, la escalabilidad, y poder tratar un mayor número de secuencias. Finalmente, proponemos Múltiples Trees Alignment (MTA), un método MSA para alinear en paralelo múltiples árboles guía, evaluar los alineamientos obtenidos y seleccionar el mejor como resultado. Los resultados experimentales han demostrado que MTA mejora considerablemente la calidad de los alineamientos. Multiple Sequence Alignment (MSA) is a powerful tool for important biological applications. MSAs are computationally difficult to calculate, and most formulations of the problem lead to NP-Hard optimization problems. To perform large-scale alignments, with thousands of sequences, new challenges need to be resolved to adapt the MSA algorithms to the High-Performance Computing era. In this thesis we propose three different approaches to solve some limitations of main MSA methods. The first proposal consists of a new guide tree construction algorithm to improve the degree of parallelism in order to resolve the bottleneck of the progressive alignment stage. The second proposal consists of optimizing the consistency library, improving the execution time and the scalability of MSA to enable the method to treat more sequences. Finally, we propose Multiple Trees Alignments (MTA), a MSA method to align in parallel multiple guide-trees, evaluate the alignments obtained and select the best one as a result. The experimental results demonstrated that MTA improves considerably the quality of the alignments.
APA, Harvard, Vancouver, ISO, and other styles
12

Thompson, James. "Genetic algorithms applied to biological sequence analysis /." Link to online version, 2006. https://ritdml.rit.edu/dspace/handle/1850/2269.

Full text
APA, Harvard, Vancouver, ISO, and other styles
13

Lyall, Andrew. "Biological sequence comparison on a parallel computer." Thesis, University of Edinburgh, 1988. http://hdl.handle.net/1842/12493.

Full text
APA, Harvard, Vancouver, ISO, and other styles
14

Stanescu, Ana. "Semi-supervised learning for biological sequence classification." Diss., Kansas State University, 2015. http://hdl.handle.net/2097/35810.

Full text
Abstract:
Doctor of Philosophy
Department of Computing and Information Sciences
Doina Caragea
Successful advances in biochemical technologies have led to inexpensive, time-efficient production of massive volumes of data, DNA and protein sequences. As a result, numerous computational methods for genome annotation have emerged, including machine learning and statistical analysis approaches that practically and efficiently analyze and interpret data. Traditional machine learning approaches to genome annotation typically rely on large amounts of labeled data in order to build quality classifiers. The process of labeling data can be expensive and time consuming, as it requires domain knowledge and expert involvement. Semi-supervised learning approaches that can make use of unlabeled data, in addition to small amounts of labeled data, can help reduce the costs associated with labeling. In this context, we focus on semi-supervised learning approaches for biological sequence classification. Although an attractive concept, semi-supervised learning does not invariably work as intended. Since the assumptions made by learning algorithms cannot be easily verified without considerable domain knowledge or data exploration, semi-supervised learning is not always "safe" to use. Advantageous utilization of the unlabeled data is problem dependent, and more research is needed to identify algorithms that can be used to increase the effectiveness of semi-supervised learning, in general, and for bioinformatics problems, in particular. At a high level, we aim to identify semi-supervised algorithms and data representations that can be used to learn effective classifiers for genome annotation tasks such as cassette exon identification, splice site identification, and protein localization. In addition, one specific challenge that we address is the "data imbalance" problem, which is prevalent in many domains, including bioinformatics. The data imbalance phenomenon arises when one of the classes to be predicted is underrepresented in the data because instances belonging to that class are rare (noteworthy cases) or difficult to obtain. Ironically, minority classes are typically the most important to learn, because they may be associated with special cases, as in the case of splice site prediction. We propose two main techniques to deal with the data imbalance problem, namely a technique based on "dynamic balancing" (augmenting the originally labeled data only with positive instances during the semi-supervised iterations of the algorithms) and another technique based on ensemble approaches. The results show that with limited amounts of labeled data, semisupervised approaches can successfully leverage the unlabeled data, thereby surpassing their completely supervised counterparts. A type of semi-supervised learning, known as "transductive" learning aims to classify the unlabeled data without generalizing to new, previously not encountered instances. Theoretically, this aspect makes transductive learning particularly suitable for the task of genome annotation, in which an entirely sequenced genome is typically available, sometimes accompanied by limited annotation. We study and evaluate various transductive approaches (such as transductive support vector machines and graph based approaches) and sequence representations for the problems of cassette exon identification. The results obtained demonstrate the effectiveness of transductive algorithms in sequence annotation tasks.
APA, Harvard, Vancouver, ISO, and other styles
15

Herndon, Nic. "Domain adaptation algorithms for biological sequence classification." Diss., Kansas State University, 2016. http://hdl.handle.net/2097/35242.

Full text
Abstract:
Doctor of Philosophy
Department of Computing and Information Sciences
Doina Caragea
The large volume of data generated in the recent years has created opportunities for discoveries in various fields. In biology, next generation sequencing technologies determine faster and cheaper the exact order of nucleotides present within a DNA or RNA fragment. This large volume of data requires the use of automated tools to extract information and generate knowledge. Machine learning classification algorithms provide an automated means to annotate data but require some of these data to be manually labeled by human experts, a process that is costly and time consuming. An alternative to labeling data is to use existing labeled data from a related domain, the source domain, if any such data is available, to train a classifier for the domain of interest, the target domain. However, the classification accuracy usually decreases for the domain of interest as the distance between the source and target domains increases. Another alternative is to label some data and complement it with abundant unlabeled data from the same domain, and train a semi-supervised classifier, although the unlabeled data can mislead such classifier. In this work another alternative is considered, domain adaptation, in which the goal is to train an accurate classifier for a domain with limited labeled data and abundant unlabeled data, the target domain, by leveraging labeled data from a related domain, the source domain. Several domain adaptation classifiers are proposed, derived from a supervised discriminative classifier (logistic regression) or a supervised generative classifier (naïve Bayes), and some of the factors that influence their accuracy are studied: features, data used from the source domain, how to incorporate the unlabeled data, and how to combine all available data. The proposed approaches were evaluated on two biological problems -- protein localization and ab initio splice site prediction. The former is motivated by the fact that predicting where a protein is localized provides an indication for its function, whereas the latter is an essential step in gene prediction.
APA, Harvard, Vancouver, ISO, and other styles
16

Blanco, García Enrique. "Meta-alignment of biological sequences." Doctoral thesis, Universitat Politècnica de Catalunya, 2006. http://hdl.handle.net/10803/6654.

Full text
Abstract:
Les seqüències són una de les estructures de dades més versàtils que existeixen. De forma relativament senzilla, en una seqüència de símbols es pot emmagatzemar informació de qualsevol tipus. L'anàlisi sistemàtic de seqüències es un àrea molt rica de l'algorísmica amb numeroses aproximacions desenvolupades amb éxit. En concret, la comparació de seqüències mitjançant l'alineament d'aquestes és una de les eines més potents. Una de les aproximacions més populars i eficients per alinear dues seqüències es l'ús de la programació dinàmica. Malgrat la seva evident utilitat, un alineament de dues seqüències no és sempre la millor opció per a caracteritzar la seva funció. Moltes vegades, les seqüències codifiquen la informació en diferents nivells (meta-informació).
És llavors quan la comparació directa entre dues seqüències no es capaç de revelar aquelles estructures d'ordre superior que podrien explicar la relació establerta entre aquestes seqüències.

Amb aquest treball hem contribuït a millorar la forma en que dues seqüències poden ser comparades, desenvolupant una família d'algorismes d'alineament de la informació d'alt nivell codificada en seqüències biològiques (meta-alineaments). Inicialment, hem redissenyat un antic algorisme, basat en programació dinàmica, que és capaç d'alinear dues seqüències de meta-informació, procedint després a introduir-hi vàries millores per accelerar la seva velocitat. A continuació hem desenvolupat un algorisme de meta-aliniament capaç d'alinear un número múltiple de seqüències, combinant l'algorisme general amb un esquema de clustering jeràrquic. A més, hem estudiat les propietats dels meta-alineaments produïts, modificant l'algorisme per tal d'identificar alineaments amb una configuració no necessàriament col.lineal, el que permet llavors la detecció de permutacions en els resultats.

La vida molecular és un exemple paradigmátic de la versatilitat de les seqüències. Les comparaciones entre genomes, ara que la seva seqüència està disponible, permeten identificar numerosos elements biològicament funcionals. La seqüència de nucleòtids de molts gens, per exemple, es troba acceptablement conservada entre diferents espècies. En canvi, les seqüències que regulen la activació dels propis gens són més curtes i variables. Així l'activació simultànea d'un conjunt de gens es pot explicar només a partir de la conservació de configuracions comunes d'elements reguladors d'alt nivell i no pas a partir de la simple conservació de les seves seqüències. Per tant, hem entrenat els nostres programes de meta-alineament en una sèrie de conjunts de regions reguladores recopilades per nosaltres mateixos de la literatura i desprès, hem provat la utilitat biològica de la nostra aproximació, caracteritzant automàticament de forma exitosa les regions activadores de gens humans conservats en altres espècies.
The sequences are very versatile data structures. In a straightforward manner, a sequence of symbols can store any type of information. Systematic analysis of sequences is a very rich area of algorithmics, with lots of successful applications. The comparison by sequence alignment is a very powerful analysis tool. Dynamic programming is one of the most popular and efficient approaches to align two sequences. However, despite their utility, alignments are not always the best option for characterizing the function of two sequences. Sequences often encode information in different levels of organization (meta-information). In these cases, direct sequence comparison is not able to unveil those higher-order structures that can actually explain the relationship between the sequences.

We have contributed with the work presented here to improve the way in which two sequences can be compared, developing a new family of algorithms that align high level information encoded in biological sequences (meta-alignment). Initially, we have redesigned an existent algorithm, based in dynamic programming, to align two sequences of meta-information, introducing later several improvements for a better performance. Next, we have developed a multiple meta-alignment algorithm, by combining the general algorithm with the progressive schema. In addition, we have studied the properties of the resulting meta-alignments, modifying the algorithm to identify non-collinear or permuted configurations.

Molecular life is a great example of the sequence versatility. Comparative genomics provide the identification of numerous biologically functional elements. The nucleotide sequence of many genes, for example, is relatively well conserved between different species. In contrast, the sequences that regulate the gene expression are shorter and weaker. Thus, the simultaneous activation of a set of genes only can be explained in terms of conservation between configurations of higher-order regulatory elements, that can not be detected at the sequence level. We, therefore, have trained our meta-alignment programs in several datasets of regulatory regions collected from the literature. Then, we have tested the accuracy of our approximation to successfully characterize the promoter regions of human genes and their orthologs in other species.
APA, Harvard, Vancouver, ISO, and other styles
17

Sandve, Geir Kjetil. "Motif discovery in biological sequences." Thesis, Norwegian University of Science and Technology, Department of Computer and Information Science, 2005. http://urn.kb.se/resolve?urn=urn:nbn:no:ntnu:diva-9270.

Full text
Abstract:

This master thesis is a Ph.D. research plan for motif discovery in biological sequences, and consists of three main parts. Chapter 2 is a survey of methods for motif discovery in DNA regulatory regions, with a special emphasis on computational models. The survey presents an integrated model of the problem that allows systematic and coherent treatment of the surveyed methods. Chapter 3 presents a new algorithm for composite motif discovery in biological sequences. This algorithm has been used with success for motif discovery in protein sequences, and will in future work be extended on to explore properties of the DNA regulatory mechanism. Finally, chapter 4 describes several current research projects, as well as some more general future directions of research. The research focuses on the development of new algorithms for the discovery of composite motifs in DNA. These algorithms will partly be used for systematic exploration of the DNA regulatory mechanism. An increased understanding of this mechanism may lead to more accurate computational models, and hence more sensitive motif discovery methods.

APA, Harvard, Vancouver, ISO, and other styles
18

Vázquez, García Ignacio. "Molecular evolution of biological sequences." Thesis, University of Cambridge, 2018. https://www.repository.cam.ac.uk/handle/1810/284174.

Full text
Abstract:
Evolution is an ubiquitous feature of living systems. The genetic composition of a population changes in response to the primary evolutionary forces: mutation, selection and genetic drift. Organisms undergoing rapid adaptation acquire multiple mutations that are physically linked in the genome, so their fates are mutually dependent and selection only acts on these loci in their entirety. This aspect has been largely overlooked in the study of asexual or somatic evolution and plays a major role in the evolution of bacterial and viral infections and cancer. In this thesis, we put forward a theoretical description for a minimal model of evolutionary dynamics to identify driver mutations, which carry a large positive fitness effect, among passenger mutations that hitchhike on successful genomes. We examine the effect this mode of selection has on genomic patterns of variation to infer the location of driver mutations and estimate their selection coefficient from time series of mutation frequencies. We then present a probabilistic model to reconstruct genotypically distinct lineages in mixed cell populations from DNA sequencing. This method uses Hidden Markov Models for the deconvolution of genetically diverse populations and can be applied to clonal admixtures of genomes in any asexual population, from evolving pathogens to the somatic evolution of cancer. To understand the effects of selection on rapidly adapting populations, we constructed sequence ensembles in a recombinant library of budding yeast (S. cerevisiae). Using DNA sequencing, we characterised the directed evolution of these populations under selective inhibition of rate-limiting steps of the cell cycle. We observed recurrent patterns of adaptive mutations and characterised common mutational processes, but the spectrum of mutations at the molecular level remained stochastic. Finally, we investigated the effect of genetic variation on the fate of new mutations, which gives rise to complex evolutionary dynamics. We demonstrate that the fitness variance of the population can set a selective threshold on new mutations, setting a limit to the efficiency of selection. In summary, we combined statistical analyses of genomic sequences, mathematical models of evolutionary dynamics and experiments in molecular evolution to advance our understanding of rapid adaptation. Our results open new avenues in our understanding of population dynamics that can be translated to a range of biological systems.
APA, Harvard, Vancouver, ISO, and other styles
19

Isa, Mohammad Nazrin. "High performance reconfigurable architectures for biological sequence alignment." Thesis, University of Edinburgh, 2013. http://hdl.handle.net/1842/7721.

Full text
Abstract:
Bioinformatics and computational biology (BCB) is a rapidly developing multidisciplinary field which encompasses a wide range of domains, including genomic sequence alignments. It is a fundamental tool in molecular biology in searching for homology between sequences. Sequence alignments are currently gaining close attention due to their great impact on the quality aspects of life such as facilitating early disease diagnosis, identifying the characteristics of a newly discovered sequence, and drug engineering. With the vast growth of genomic data, searching for a sequence homology over huge databases (often measured in gigabytes) is unable to produce results within a realistic time, hence the need for acceleration. Since the exponential increase of biological databases as a result of the human genome project (HGP), supercomputers and other parallel architectures such as the special purpose Very Large Scale Integration (VLSI) chip, Graphic Processing Unit (GPUs) and Field Programmable Gate Arrays (FPGAs) have become popular acceleration platforms. Nevertheless, there are always trade-off between area, speed, power, cost, development time and reusability when selecting an acceleration platform. FPGAs generally offer more flexibility, higher performance and lower overheads. However, they suffer from a relatively low level programming model as compared with off-the-shelf microprocessors such as standard microprocessors and GPUs. Due to the aforementioned limitations, the need has arisen for optimized FPGA core implementations which are crucial for this technology to become viable in high performance computing (HPC). This research proposes the use of state-of-the-art reprogrammable system-on-chip technology on FPGAs to accelerate three widely-used sequence alignment algorithms; the Smith-Waterman with affine gap penalty algorithm, the profile hidden Markov model (HMM) algorithm and the Basic Local Alignment Search Tool (BLAST) algorithm. The three novel aspects of this research are firstly that the algorithms are designed and implemented in hardware, with each core achieving the highest performance compared to the state-of-the-art. Secondly, an efficient scheduling strategy based on the double buffering technique is adopted into the hardware architectures. Here, when the alignment matrix computation task is overlapped with the PE configuration in a folded systolic array, the overall throughput of the core is significantly increased. This is due to the bound PE configuration time and the parallel PE configuration approach irrespective of the number of PEs in a systolic array. In addition, the use of only two configuration elements in the PE optimizes hardware resources and enables the scalability of PE systolic arrays without relying on restricted onboard memory resources. Finally, a new performance metric is devised, which facilitates the effective comparison of design performance between different FPGA devices and families. The normalized performance indicator (speed-up per area per process technology) takes out advantages of the area and lithography technology of any FPGA resulting in fairer comparisons. The cores have been designed using Verilog HDL and prototyped on the Alpha Data ADM-XRC-5LX card with the Virtex-5 XC5VLX110-3FF1153 FPGA. The implementation results show that the proposed architectures achieved giga cell updates per second (GCUPS) performances of 26.8, 29.5 and 24.2 respectively for the acceleration of the Smith-Waterman with affine gap penalty algorithm, the profile HMM algorithm and the BLAST algorithm. In terms of speed-up improvements, comparisons were made on performance of the designed cores against their corresponding software and the reported FPGA implementations. In the case of comparison with equivalent software execution, acceleration of the optimal alignment algorithm in hardware yielded an average speed-up of 269x as compared to the SSEARCH 35 software. For the profile HMM-based sequence alignment, the designed core achieved speed-up of 103x and 8.3x against the HMMER 2.0 and the latest version of HMMER (version 3.0) respectively. On the other hand, the implementation of the gapped BLAST with the two-hit method in hardware achieved a greater than tenfold speed-up compared to the latest NCBI BLAST software. In terms of comparison against other reported FPGA implementations, the proposed normalized performance indicator was used to evaluate the designed architectures fairly. The results showed that the first architecture achieved more than 50 percent improvement, while acceleration of the profile HMM sequence alignment in hardware gained a normalized speed-up of 1.34. In the case of the gapped BLAST with the two-hit method, the designed core achieved 11x speed-up after taking out advantages of the Virtex-5 FPGA. In addition, further analysis was conducted in terms of cost and power performances; it was noted that, the core achieved 0.46 MCUPS per dollar spent and 958.1 MCUPS per watt. This shows that FPGAs can be an attractive platform for high performance computation with advantages of smaller area footprint as well as represent economic ‘green’ solution compared to the other acceleration platforms. Higher throughput can be achieved by redeploying the cores on newer, bigger and faster FPGAs with minimal design effort.
APA, Harvard, Vancouver, ISO, and other styles
20

Tangirala, Karthik. "Unsupervised feature construction approaches for biological sequence classification." Diss., Kansas State University, 2015. http://hdl.handle.net/2097/19123.

Full text
Abstract:
Doctor of Philosophy
Department of Computing and Information Sciences
Doina Caragea
Recent advancements in biological sciences have resulted in the availability of large amounts of sequence data (DNA and protein sequences). Biological sequence data can be annotated using machine learning techniques, but most learning algorithms require data to be represented by a vector of features. In the absence of biologically informative features, k-mers generated using a sliding window-based approach are commonly used to represent biological sequences. A larger k value typically results in better features; however, the number of k-mer features is exponential in k, and many k-mers are not informative. Feature selection is widely used to reduce the dimensionality of the input feature space. Most feature selection techniques use feature-class dependency scores to rank the features. However, when the amount of available labeled data is small, feature selection techniques may not accurately capture feature-class dependency scores. Therefore, instead of working with all k-mers, this dissertation proposes the construction of a reduced set of informative k-mers that can be used to represent biological sequences. This work resulted in three novel unsupervised approaches to construct features: 1. Burrows Wheeler Transform-based approach, that uses the sorted permutations of a given sequence to construct sequential features (subsequences) that occur multiple times in a given sequence. 2. Community detection-based approach, that uses a community detection algorithm to group similar subsequences into communities and refines the communities to form motifs (group of similar subsequences). Motifs obtained using the community detection-based approach satisfy the ZOMOPS constraint (Zero, One or Multiple Occurrences of a Motif Per Sequence). All possible unique subsequences of the obtained motifs are then used as features to represent the sequences. 3. Hybrid-based approach, that combines the Burrows Wheeler Transform-based approach and the community detection-based approach to allow certain mismatches to the features constructed using the Burrows Wheeler Transform-based approach. To evaluate the predictive power of the features constructed using the proposed approaches, experiments were conducted in three learning scenarios: supervised, semi-supervised, and domain adaptation for both nucleotide and protein sequence classification problems. The performance of classifiers learned using features generated with the proposed approaches was compared with the performance of the classifiers learned using k-mers (with feature selection) and feature hashing (another unsupervised dimensionality reduction technique). Experimental results from the three learning scenarios showed that features constructed with the proposed approaches were typically more informative than k-mers and feature hashing.
APA, Harvard, Vancouver, ISO, and other styles
21

Pappas, Nicholas Peter. "Searching Biological Sequence Databases Using Distributed Adaptive Computing." Thesis, Virginia Tech, 2003. http://hdl.handle.net/10919/31074.

Full text
Abstract:
Genetic research projects currently can require enormous computing power to processes the vast quantities of data available. Further, DNA sequencing projects are generating data at an exponential rate greater than that of the development microprocessor technology; thus, new, faster methods and techniques of processing this data are needed. One common type of processing involves searching a sequence database for the most similar sequences. Here we present a distributed database search system that utilizes adaptive computing technologies. The search is performed using the Smith-Waterman algorithm, a common sequence comparison algorithm. To reduce the total search time, an initial search is performed using a version of the algorithm, implemented in adaptive computing hardware, which is designed to efficiently perform the initial search. A final search is performed using a complete version of the algorithm. This two-stage search, employing adaptive and distributed hardware, achieves a performance increase of several orders of magnitude over similar processor based systems.
Master of Science
APA, Harvard, Vancouver, ISO, and other styles
22

Kim, Eagu. "Inverse Parametric Alignment for Accurate Biological Sequence Comparison." Diss., The University of Arizona, 2008. http://hdl.handle.net/10150/193664.

Full text
Abstract:
For as long as biologists have been computing alignments of sequences, the question of what values to use for scoring substitutions and gaps has persisted. In practice, substitution scores are usually chosen by convention, and gap penalties are often found by trial and error. In contrast, a rigorous way to determine parameter values that are appropriate for aligning biological sequences is by solving the problem of Inverse Parametric Sequence Alignment. Given examples of biologically correct reference alignments, this is the problem of finding parameter values that make the examples score as close as possible to optimal alignments of their sequences. The reference alignments that are currently available contain regions where the alignment is not specified, which leads to a version of the problem with partial examples.In this dissertation, we develop a new polynomial-time algorithm for Inverse Parametric Sequence Alignment that is simple to implement, fast in practice, and can learn hundreds of parameters simultaneously from hundreds of examples. Computational results with partial examples show that best possible values for all 212 parameters of the standard alignment scoring model for protein sequences can be computed from 200 examples in 4 hours of computation on a standard desktop machine. We also consider a new scoring model with a small number of additional parameters that incorporates predicted secondary structure for the protein sequences. By learning parameter values for this new secondary-structure-based model, we can improve on the alignment accuracy of the standard model by as much as 15% for sequences with less than 25% identity.
APA, Harvard, Vancouver, ISO, and other styles
23

Verzotto, Davide. "Advanced Computational Methods for Massive Biological Sequence Analysis." Doctoral thesis, Università degli studi di Padova, 2011. http://hdl.handle.net/11577/3426282.

Full text
Abstract:
With the advent of modern sequencing technologies massive amounts of biological data, from protein sequences to entire genomes, are becoming increasingly available. This poses the need for the automatic analysis and classification of such a huge collection of data, in order to enhance knowledge in the Life Sciences. Although many research efforts have been made to mathematically model this information, for example finding patterns and similarities among protein or genome sequences, these approaches often lack structures that address specific biological issues. In this thesis, we present novel computational methods for three fundamental problems in molecular biology: the detection of remote evolutionary relationships among protein sequences, the identification of subtle biological signals in related genome or protein functional sites, and the phylogeny reconstruction by means of whole-genome comparisons. The main contribution is given by a systematic analysis of patterns that may affect these tasks, leading to the design of practical and efficient new pattern discovery tools. We thus introduce two advanced paradigms of pattern discovery and filtering based on the insight that functional and conserved biological motifs, or patterns, should lie in different sites of sequences. This enables to carry out space-conscious approaches that avoid a multiple counting of the same patterns. The first paradigm considered, namely irredundant common motifs, concerns the discovery of common patterns, for two sequences, that have occurrences not covered by other patterns, whose coverage is defined by means of specificity and extension. The second paradigm, namely underlying motifs, concerns the filtering of patterns, from a given set, that have occurrences not overlapping other patterns with higher priority, where priority is defined by lexicographic properties of patterns on the boundary between pattern matching and statistical analysis. We develop three practical methods directly based on these advanced paradigms. Experimental results indicate that we are able to identify subtle similarities among biological sequences, using the same type of information only once. In particular, we employ the irredundant common motifs and the statistics based on these patterns to solve the remote protein homology detection problem. Results show that our approach, called Irredundant Class, outperforms the state-of-the-art methods in a challenging benchmark for protein analysis. Afterwards, we establish how to compare and filter a large number of complex motifs (e.g., degenerate motifs) obtained from modern motif discovery tools, in order to identify subtle signals in different biological contexts. In this case we employ the notion of underlying motifs. Tests on large protein families indicate that we drastically reduce the number of motifs that scientists should manually inspect, further highlighting the actual functional motifs. Finally, we combine the two proposed paradigms to allow the comparison of whole genomes, and thus the construction of a novel and practical distance function. With our method, called Unic Subword Approach, we relate to each other the regions of two genome sequences by selecting conserved motifs during evolution. Experimental results show that our approach achieves better performance than other state-of-the-art methods in the whole-genome phylogeny reconstruction of viruses, prokaryotes, and unicellular eukaryotes, further identifying the major clades of these organisms.
Con l'avvento delle moderne tecnologie di sequenziamento, massive quantità di dati biologici, da sequenze proteiche fino a interi genomi, sono disponibili per la ricerca. Questo progresso richiede l'analisi e la classificazione automatica di tali collezioni di dati, al fine di migliorare la conoscenza nel campo delle Scienze della Vita. Nonostante finora siano stati proposti molti approcci per modellare matematicamente le sequenze biologiche, ad esempio cercando pattern e similarità tra sequenze genomiche o proteiche, questi metodi spesso mancano di strutture in grado di indirizzare specifiche questioni biologiche. In questa tesi, presentiamo nuovi metodi computazionali per tre problemi fondamentali della biologia molecolare: la scoperta di relazioni evolutive remote tra sequenze proteiche, l'individuazione di segnali biologici complessi in siti funzionali tra loro correlati, e la ricostruzione della filogenesi di un insieme di organismi, attraverso la comparazione di interi genomi. Il principale contributo è dato dall'analisi sistematica dei pattern che possono interessare questi problemi, portando alla progettazione di nuovi strumenti computazionali efficaci ed efficienti. Vengono introdotti così due paradigmi avanzati per la scoperta e il filtraggio di pattern, basati sull'osservazione che i motivi biologici funzionali, o pattern, sono localizzati in differenti regioni delle sequenze in esame. Questa osservazione consente di realizzare approcci parsimoniosi in grado di evitare un conteggio multiplo degli stessi pattern. Il primo paradigma considerato, ovvero irredundant common motifs, riguarda la scoperta di pattern comuni a coppie di sequenze che hanno occorrenze non coperte da altri pattern, la cui copertura è definita da una maggiore specificità e/o possibile estensione dei pattern. Il secondo paradigma, ovvero underlying motifs, riguarda il filtraggio di pattern che hanno occorrenze non sovrapposte a quelle di altri pattern con maggiore priorità, dove la priorità è definita da proprietà lessicografiche dei pattern al confine tra pattern matching e analisi statistica. Sono stati sviluppati tre metodi computazionali basati su questi paradigmi avanzati. I risultati sperimentali indicano che i nostri metodi sono in grado di identificare le principali similitudini tra sequenze biologiche, utilizzando l'informazione presente in maniera non ridondante. In particolare, impiegando gli irredundant common motifs e le statistiche basate su questi pattern risolviamo il problema della rilevazione di omologie remote tra proteine. I risultati evidenziano che il nostro approccio, chiamato Irredundant Class, ottiene ottime prestazioni su un benchmark impegnativo, e migliora i metodi allo stato dell'arte. Inoltre, per individuare segnali biologici complessi utilizziamo la nozione di underlying motifs, definendo così alcune modalità per il confronto e il filtraggio di motivi degenerati ottenuti tramite moderni strumenti di pattern discovery. Esperimenti su grandi famiglie proteiche dimostrano che il nostro metodo riduce drasticamente il numero di motivi che gli scienziati dovrebbero altrimenti ispezionare manualmente, mettendo in luce inoltre i motivi funzionali identificati in letteratura. Infine, combinando i due paradigmi proposti presentiamo una nuova e pratica funzione di distanza tra interi genomi. Con il nostro metodo, chiamato Unic Subword Approach, relazioniamo tra loro le diverse regioni di due sequenze genomiche, selezionando i motivi conservati durante l'evoluzione. I risultati sperimentali evidenziano che il nostro approccio offre migliori prestazioni rispetto ad altri metodi allo stato dell'arte nella ricostruzione della filogenesi di organismi quali virus, procarioti ed eucarioti unicellulari, identificando inoltre le sottoclassi principali di queste specie.
APA, Harvard, Vancouver, ISO, and other styles
24

Mohanty, Pragyan Paramita. "Function-based Algorithms for Biological Sequences." OpenSIUC, 2015. https://opensiuc.lib.siu.edu/dissertations/1120.

Full text
Abstract:
AN ABSTRACT OF THE DISSERTATION OF PRAGYAN P. MOHANTY, for the Doctor of Philosophy degree in ELECTRICAL AND COMPUTER ENGINEERING, presented on June 11, 2015, at Southern Illinois University Carbondale. TITLE: FUNCTION-BASED ALGORITHMS FOR BIOLOGICAL SEQUENCES MAJOR PROFESSOR: Dr. Spyros Tragoudas Two problems at two different abstraction levels of computational biology are studied. At the molecular level, efficient pattern matching algorithms in DNA sequences are presented. For gene order data, an efficient data structure is presented capable of storing all gene re-orderings in a systematic manner. A common characteristic of presented methods is the use of binary decision diagrams that store and manipulate binary functions. Searching for a particular pattern in a very large DNA database, is a fundamental and essential component in computational biology. In the biological world, pattern matching is required for finding repeats in a particular DNA sequence, finding motif and aligning sequences etc. Due to immense amount and continuous increase of biological data, the searching process requires very fast algorithms. This also requires encoding schemes for efficient storage of these search processes to operate on. Due to continuous progress in genome sequencing, genome rearrangements and construction of evolutionary genome graphs, which represent the relationships between genomes, become challenging tasks. Previous approaches are largely based on distance measure so that relationship between more phylogenetic species can be established with some specifically required rearrangement operations and hence within certain computational time. However because of the large volume of the available data, storage space and construction time for this evolutionary graph is still a problem. In addition, it is important to keep track of all possible rearrangement operations for a particular genome as biological processes are uncertain. This study presents a binary function-based tool set for efficient DNA sequence storage. A novel scalable method is also developed for fast offline pattern searches in large DNA sequences. This study also presents a method which efficiently stores all the gene sequences associated with all possible genome rearrangements such as transpositions and construct the evolutionary genome structure much faster for multiple species. The developed methods benefit from the use of Boolean functions; their compact storage using canonical data structure and the existence of built-in operators for these data structures. The time complexities depend on the size of the data structures used for storing the functions that represent the DNA sequences and/or gene sequences. It is shown that the presented approaches exhibit sub linear time complexity to the sequence size. The number of nodes present in the DNA data structure, string search time on these data structures, depths of the genome graph structure, and the time of the rearrangement operations are reported. Experiments on DNA sequences from the NCBI database are conducted for DNA sequence storage and search process. Experiments on large gene order data sets such as: human mitochondrial data and plant chloroplast data are conducted and depth of this structure was studied for evolutionary processes on gene sequences. The results show that the developed approaches are scalable.
APA, Harvard, Vancouver, ISO, and other styles
25

Margolin, Yelena 1977. "Analysis of sequence-selective guanine oxidation by biological agents." Thesis, Massachusetts Institute of Technology, 2007. http://hdl.handle.net/1721.1/42381.

Full text
Abstract:
Thesis (Ph. D.)--Massachusetts Institute of Technology, Biological Engineering Division, February 2008.
Vita.
Includes bibliographical references.
Oxidatively damaged DNA has been strongly associated with cancer, chronic degenerative diseases and aging. Guanine is the most frequently oxidized base in the DNA, and generation of a guanine radical cation (G'") as an intermediate in the oxidation reaction leads to migration of a resulting cationic hole through the DNA n-stack until it is trapped at the lowest-energy sites. These sites reside at runs of guanines, such as 5'-GG-3' sequences, and are characterized by the lowest sequence-specific ionization potentials (IPs). The charge transfer mechanism suggests that hotspots of oxidative DNA damage induced by electron transfer reagents can be predicted based on the primary DNA sequence. However, preliminary data indicated that nitrosoperoxycarbonate (ONOOCO2"), a mediator of chronic inflammation and a one-electron oxidant, displayed unusual guanine oxidation properties that were the focus of present work. As a first step in our study, we determined relative levels of guanine oxidation, induced by ONOOCO2 in all possible three-base sequence contexts (XGY) within double-stranded oligonucleotides. These levels were compared to the relative oxidation induced within the same guanines by photoactivated riboflavin, a one-electron reagent. We found that, in agreement with previous studies, photoactivated riboflavin was selective for guanines of lowest IPs located within 5'-GG-3' sequences. In contrast, ONOOCO2" preferentially reacted with guanines located within 5'-GC-3' sequences characterized by the highest IPs. This demonstrated that that sequence-specific IP was not a determinant of guanine reactivity with ONOOCO2". Sequence selectivities for both reagents were double-strand specific. Selectivity of ONOOCO2 for 5'-GC-3' sites was also observed in human genomic DNA after ligation-mediated PCR analysis.
(cont.) Relative yields of different guanine lesions produced by both ONOOCO2" and riboflavin varied 4- to 5-fold across all sequence contexts. To assess the role of solvent exposure in mediating guanine oxidation by ONOOCO2", relative reactivities of mismatched guanines with ONOOCO2" were measured. The majority of the mismatches displayed an increased reactivity with ONOOCO2 as compared to the fully matched G-C base-pairs. The extent of reactivity enhancement was sequence context-dependent, and the greatest levels of enhancement were observed for the conformationally flexible guanine- guanine (G-G) mismatches and for guanines located across from a synthetic abasic site. To test the hypothesis that the negative charge of an oxidant influences its reactivity with guanines in DNA, sequence-selective guanine oxidation by a negatively charged reagent, Fe+2-EDTA, was assessed and compared to guanine oxidation produced by a neutral oxidant, y-radiation. Because both of these agents cause high levels of deoxyribose oxidation, a general method to quantify sequence-specific nucleobase oxidation in the presence of direct strand breaks was developed. This method exploited activity of exonuclease III (Exo III), a 3' to 5' exonuclease, and utilized phosphorothioate-modified synthetic oligonucleotides that were resistant to Exo III activity. This method was employed to determine sequence-selective guanine oxidation by Fe+2-EDTA complex and y-radiation and to show that both agents produced identical guanine oxidation pattems and were equally reactive with all guanines, irrespective of their sequence-specific IPs or sequence context.
(cont.) This showed that negative charge was not a determinant of Fe+2-EDTA-mediated guanine oxidation. Finally, the role of oxidant binding on nucleobase damage was assessed by studying sequence-selective oxidation produced by DNA-bound Fe+2 ions in the presence of H202. We found that the major oxidation targets were thymines located within 5'-TGG-3' motifs, demonstrating that while guanines were a required element for coordination of Fe+2 to DNA, they were not oxidized. Our results suggest that factors other than sequence-specific IPs can act as major determinants of sequence-selective guanine oxidation, and that current models of guanine oxidation and charge transfer in DNA cannot be used to adequately predict the location and identity of mutagenic lesions in the genome.
by Yelena Margolin.
Ph.D.
APA, Harvard, Vancouver, ISO, and other styles
26

Tångrot, Jeanette. "Structural Information and Hidden Markov Models for Biological Sequence Analysis." Doctoral thesis, Umeå universitet, Institutionen för datavetenskap, 2008. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-1629.

Full text
Abstract:
Bioinformatics is a fast-developing field, which makes use of computational methods to analyse and structure biological data. An important branch of bioinformatics is structure and function prediction of proteins, which is often based on finding relationships to already characterized proteins. It is known that two proteins with very similar sequences also share the same 3D structure. However, there are many proteins with similar structures that have no clear sequence similarity, which make it difficult to find these relationships. In this thesis, two methods for annotating protein domains are presented, one aiming at assigning the correct domain family or families to a protein sequence, and the other aiming at fold recognition. Both methods use hidden Markov models (HMMs) to find related proteins, and they both exploit the fact that structure is more conserved than sequence, but in two different ways. Most of the research presented in the thesis focuses on the structure-anchored HMMs, saHMMs. For each domain family, an saHMM is constructed from a multiple structure alignment of carefully selected representative domains, the saHMM-members. These saHMM-members are collected in the so called "midnight ASTRAL set", and are chosen so that all saHMM-members within the same family have mutual sequence identities below a threshold of about 20%. In order to construct the midnight ASTRAL set and the saHMMs, a pipe-line of software tools are developed. The saHMMs are shown to be able to detect the correct family relationships at very high accuracy, and perform better than the standard tool Pfam in assigning the correct domain families to new domain sequences. We also introduce the FI-score, which is used to measure the performance of the saHMMs, in order to select the optimal model for each domain family. The saHMMs are made available for searching through the FISH server, and can be used for assigning family relationships to protein sequences. The other approach presented in the thesis is secondary structure HMMs (ssHMMs). These HMMs are designed to use both the sequence and the predicted secondary structure of a query protein when scoring it against the model. A rigorous benchmark is used, which shows that HMMs made from multiple sequences result in better fold recognition than those based on single sequences. Adding secondary structure information to the HMMs improves the ability of fold recognition further, both when using true and predicted secondary structures for the query sequence.
Bioinformatik är ett område där datavetenskapliga och statistiska metoder används för att analysera och strukturera biologiska data. Ett viktigt område inom bioinformatiken försöker förutsäga vilken tredimensionell struktur och funktion ett protein har, utifrån dess aminosyrasekvens och/eller likheter med andra, redan karaktäriserade, proteiner. Det är känt att två proteiner med likande aminosyrasekvenser också har liknande tredimensionella strukturer. Att två proteiner har liknande strukturer behöver dock inte betyda att deras sekvenser är lika, vilket kan göra det svårt att hitta strukturella likheter utifrån ett proteins aminosyrasekvens. Den här avhandlingen beskriver två metoder för att hitta likheter mellan proteiner, den ena med fokus på att bestämma vilken familj av proteindomäner, med känd 3D-struktur, en given sekvens tillhör, medan den andra försöker förutsäga ett proteins veckning, d.v.s. ge en grov bild av proteinets struktur. Båda metoderna använder s.k. dolda Markov modeller (hidden Markov models, HMMer), en statistisk metod som bland annat kan användas för att beskriva proteinfamiljer. Med hjälp en HMM kan man förutsäga om en viss proteinsekvens tillhör den familj modellen representerar. Båda metoderna använder också strukturinformation för att öka modellernas förmåga att känna igen besläktade sekvenser, men på olika sätt. Det mesta av arbetet i avhandlingen handlar om strukturellt förankrade HMMer (structure-anchored HMMs, saHMMer). För att bygga saHMMerna används strukturbaserade sekvensöverlagringar, vilka genereras utifrån hur proteindomänerna kan läggas på varandra i rymden, snarare än utifrån vilka aminosyror som ingår i deras sekvenser. I varje proteinfamilj används bara ett särskilt, representativt urval av domäner. Dessa är valda så att då sekvenserna jämförs parvis, finns det inget par inom familjen med högre sekvensidentitet än ca 20%. Detta urval görs för att få så stor spridning som möjligt på sekvenserna inom familjen. En programvaruserie har utvecklats för att välja ut representanter för varje familj och sedan bygga saHMMer baserade på dessa. Det visar sig att saHMMerna kan hitta rätt familj till en hög andel av de testade sekvenserna, med nästan inga fel. De är också bättre än den ofta använda metoden Pfam på att hitta rätt familj till helt nya proteinsekvenser. saHMMerna finns tillgängliga genom FISH-servern, vilken alla kan använda via Internet för att hitta vilken familj ett intressant protein kan tillhöra. Den andra metoden som presenteras i avhandlingen är sekundärstruktur-HMMer, ssHMMer, vilka är byggda från vanliga multipla sekvensöverlagringar, men också från information om vilka sekundärstrukturer proteinsekvenserna i familjen har. När en proteinsekvens jämförs med ssHMMen används en förutsägelse om sekundärstrukturen, och den beräknade sannolikheten att sekvensen tillhör familjen kommer att baseras både på sekvensen av aminosyror och på sekundärstrukturen. Vid en jämförelse visar det sig att HMMer baserade på flera sekvenser är bättre än sådana baserade på endast en sekvens, när det gäller att hitta rätt veckning för en proteinsekvens. HMMerna blir ännu bättre om man också tar hänsyn till sekundärstrukturen, både då den riktiga sekundärstrukturen används och då man använder en teoretiskt förutsagd.
Jeanette Hargbo.
APA, Harvard, Vancouver, ISO, and other styles
27

Budach, Stefan [Verfasser]. "Explainable deep learning models for biological sequence classification / Stefan Budach." Berlin : Freie Universität Berlin, 2021. http://d-nb.info/1230407413/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
28

Buckingham, Lawrence. "K-mer based algorithms for biological sequence comparison and search." Thesis, Queensland University of Technology, 2022. https://eprints.qut.edu.au/236377/1/Buckingham%2BThesis%281%29.pdf.

Full text
Abstract:
The present thesis develops novel algorithms for biological sequence comparison and accelerated sequence database search, motivated by the need to work effectively with the rapidly expanding volume of sequence data which is available as the result of continuing advances in sequencing technology. Empirical tests using datasets of realistic size and content demonstrate that these algorithms are approximately an order of magnitude faster than standard sequence database search tools, while attaining higher precision. While the algorithms have been developed and tested in a biological context, they are applicable to any problem involving comparison of sequential data series.
APA, Harvard, Vancouver, ISO, and other styles
29

Valebjørg, Vetle Søraas. "Discovery of approximate composite motifs in biological sequences." Thesis, Norwegian University of Science and Technology, Department of Computer and Information Science, 2006. http://urn.kb.se/resolve?urn=urn:nbn:no:ntnu:diva-10130.

Full text
Abstract:

Mapping the regulatory system in living organisms is a great challenge, and many methods have been created during the last 15 years to solve this problem. The biological processes are however more flexible and complex than first thought, and many of the methods lack the ability to imitate this exactly. The new method devised here is not a complete solution to this situation, but pose an innovative solution for finding approximate composite patterns in a set of sequences. Motifs are read from any third-party tool represented as either {A,C,G,T}, IUPAC or PWMs, and weighted with significance and support as an estimate to how important the patterns are. Finding combinations with both high significance and support can reveal important properties preserved in the sequences. Based on this, the algorithm use a branch-and-bound approach to traverse every combination while preserving the best solutions in this multiple object optimization problem in a Pareto front. The best patterns found, are investigated further by applying different statistical and experimental method to better support the significance of the patterns found. The three most important tests done on the TransCompel dataset, where (i) to look at the patterns predicted measured against known sites based on nucleotide correlation. (ii) Find the frequency for motifs participating in the combinations, so that the best could be studied manually. And (iii), different test where compared when the significance was based on real background sequences instead of the uniform distribution. Some of the results found where low, but still similar to the accuracy provided by other known methods that have been tested with the same methods. The test results can be biased by the parameters used, a too simple and restrictive test set or by faulty predictions done one the dataset tested. More testing and tuning of parameters might result in better predictions. However, the different tests still proved this method to be a valuable tool in composite motif discovery.

APA, Harvard, Vancouver, ISO, and other styles
30

Pethica, Ralph Brian. "Sequences, structures and biological functions of molecular evolution." Thesis, University of Bristol, 2011. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.546211.

Full text
APA, Harvard, Vancouver, ISO, and other styles
31

Mann, Anita. "Structures and biological effects of repeated DNA sequences." Thesis, University of Kent, 1997. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.263749.

Full text
APA, Harvard, Vancouver, ISO, and other styles
32

MENES, ALEJANDRO MUSTELIER. "QUALITY EVALUATION FOR FRAGMENTS ASSEMBLY OF BIOLOGICAL SEQUENCES." PONTIFÍCIA UNIVERSIDADE CATÓLICA DO RIO DE JANEIRO, 2017. http://www.maxwell.vrac.puc-rio.br/Busca_etds.php?strSecao=resultado&nrSeq=33967@1.

Full text
Abstract:
PONTIFÍCIA UNIVERSIDADE CATÓLICA DO RIO DE JANEIRO
COORDENAÇÃO DE APERFEIÇOAMENTO DO PESSOAL DE ENSINO SUPERIOR
PROGRAMA DE EXCELENCIA ACADEMICA
Nos últimos anos surgiram novas tecnologias de sequenciamento de DNA conhecidas como NGS - Next-Generation Sequencing. Estas são responsáveis por tornar o processo de sequenciamento mais rápido e menos custoso, mas também trazem como resultado fragmentos de DNA muito pequenos, conhecidos como reads. A montagem do genoma a partir destes fragmentos é considerada um problema complexo devido à sua natureza combinatória e ao grande volume de reads produzidos. De maneira geral, os biólogos e bioinformatas escolhem o programa montador de sequências sem levar em consideração informações da eficiência computacional ou da qualidade biológica do resultado. Esta pesquisa tem como objetivo auxiliar aos usuários biólogos a avaliar a qualidade dos resultados da montagem. Primeiramente, foi projetada e desenvolvida uma metodologia para obter informações dos genes presentes na montagem, listando os genes que podem ser identificados, aqueles que têm o tamanho correto e a sequência de pares de bases correta. Em segundo lugar, foram realizados testes experimentais exaustivos envolvendo cinco dos principais montadores de genoma conhecidos na literatura os quais são baseados no uso de grafos de Bruijn e oito genomas de bactérias. Foram feitas comparações estatísticas do resultado usando as ferramentas QUAST e REAPR. Também foram obtidas informações qualitativas dos genes usando o algoritmo proposto e algumas métricas de eficiência. Em função dos resultados coletados, é feita uma análise comparativa que permite aos usuários conhecer melhor o comportamento das ferramentas consideradas nos testes. Por fim, foi desenvolvida uma ferramenta que recebe diferentes resultados de montagens de um mesmo genoma e produz um relatório qualitativo e quantitativo para o usuário interpretar os resultados de maneira integrada.
New DNA sequencing technologies, known as NGS - Next-Generation Sequencing, are responsible for making the sequencing process more efficient. However, they generate a result with very small DNA fragments, known as reads. We consider the genome assembly from these fragments a complex problem due to its combinatorial nature and the large volume of reads produced. In general, biologists and bioinformatics experts choose the sequence assembler program with no regard to the computational efficiency or even the quality of the biological result information. This research aims to assist users in the interpretation of assembly results, including effectiveness and efficiency. In addition, this may sometimes increase the quality of the results obtained. Firstly, we propose an algorithm to obtain information about the genes present in the result assembly. We enumerate the identified genes, those that have the correct size and the correct base pair sequence. Next, exhaustive experimental tests involving five of the main genome assemblers in the literature which are based on the use of graphs of Bruijn and eight bacterial genomes data set were ran. We have performed statistical comparisons of results using QUAST and REAPR tools. We have also obtained qualitative information for the genes using the proposed algorithm and some computational efficiency metrics. Based on the collected results, we present a comparative analysis that allows users to understand further the behavior of the tools considered in the tests. Finally, we propose a tool that receives different assemblies of the same genome and produces a qualitative and quantitative report for the user, enabling the interpretation of the results in an integrated way.
APA, Harvard, Vancouver, ISO, and other styles
33

Hu, Xiong. "Examining biological function and recombination using nucleotide sequences /." The Ohio State University, 1998. http://rave.ohiolink.edu/etdc/view?acc_num=osu1487950153601498.

Full text
APA, Harvard, Vancouver, ISO, and other styles
34

Robertson, Jeffrey Alan. "Entropy Measurements and Ball Cover Construction for Biological Sequences." Thesis, Virginia Tech, 2018. http://hdl.handle.net/10919/84470.

Full text
Abstract:
As improving technology is making it easier to select or engineer DNA sequences that produce dangerous proteins, it is important to be able to predict whether a novel DNA sequence is potentially dangerous by determining its taxonomic identity and functional characteristics. These tasks can be facilitated by the ever increasing amounts of available biological data. Unfortunately, though, these growing databases can be difficult to take full advantage of due to the corresponding increase in computational and storage costs. Entropy scaling algorithms and data structures present an approach that can expedite this type of analysis by scaling with the amount of entropy contained in the database instead of scaling with the size of the database. Because sets of DNA and protein sequences are biologically meaningful instead of being random, they demonstrate some amount of structure instead of being purely random. As biological databases grow, taking advantage of this structure can be extremely beneficial. The entropy scaling sequence similarity search algorithm introduced here demonstrates this by accelerating the biological sequence search tools BLAST and DIAMOND. Tests of the implementation of this algorithm shows that while this approach can lead to improved query times, constructing the required entropy scaling indices is difficult and expensive. To improve performance and remove this bottleneck, I investigate several ideas for accelerating building indices that support entropy scaling searches. The results of these tests identify key tradeoffs and demonstrate that there is potential in using these techniques for sequence similarity searches.
Master of Science
APA, Harvard, Vancouver, ISO, and other styles
35

Cinar, Ayse Basak. "Preadolescents and their mothers as oral health-promoting actors : non-biologic determinants of oral health among Turkish and Finnish preadolescents /." Helsinki : University of Helsinki, 2008. https://oa.doria.fi/bitstream/handle/10024/42564/preadole.pdf?sequence=1.

Full text
APA, Harvard, Vancouver, ISO, and other styles
36

Mundhada, Hemanshu [Verfasser]. "Advancements of the Sequence Saturation Mutagenesis (SeSaM) Method to Efficiently Explore Protein Sequence Space / Hemanshu Mundhada." Bremen : IRC-Library, Information Resource Center der Jacobs University Bremen, 2012. http://d-nb.info/1035211459/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
37

Won, Kyoung-Jae. "Exploring the structure of Hidden Markov Models for biological sequence analysis." Thesis, University of Southampton, 2005. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.427702.

Full text
APA, Harvard, Vancouver, ISO, and other styles
38

Yang, Qingwu. "Finding conserved patterns in biological sequences, networks and genomes." [College Station, Tex. : Texas A&M University, 2007. http://hdl.handle.net/1969.1/ETD-TAMU-2465.

Full text
APA, Harvard, Vancouver, ISO, and other styles
39

Korol, Oksana. "ModuleInducer: Automating the Extraction of Knowledge from Biological Sequences." Thèse, Université d'Ottawa / University of Ottawa, 2011. http://hdl.handle.net/10393/20320.

Full text
Abstract:
In the past decade, fast advancements have been made in the sequencing, digitalization and collection of the biological data. However the bottleneck remains at the point of analysis and extraction of patterns from the data. We have developed a method that is aimed at widening this bottleneck by automating the knowledge extraction from the biological data. Our approach is aimed at discovering patterns in a set of DNA sequences based on the location of transcription factor binding sites or any other biological markers with the emphasis of discovering relationships. A variety of statistical and computational methods exists to analyze such data. However, they either require an initial hypothesis, which is later tested, or classify the data based on its attributes. Our approach does not require an initial hypothesis and the classification it produces is based on the relationships between attributes. The value of such approach is that is is able to uncover new knowledge about the data by inducing a general theory based on basic known rules. The core of our approach lies in an inductive logic programming engine, which, based on positive and negative examples as well as background knowledge, is able to induce a descriptive, human-readable theory, describing the data. An application provides an end-to-end analysis of DNA sequences. A simple to use Web interface accepts a set of related sequences to be analyzed, set of negative example sequences to contrast the main set (optional), and a set of possible genetic markers as position-specific scoring matrices. A Java-based backend formats the sequences, determines the location of the genetic markers inside them and passes the information to the ILP engine, which induces the theory. The model, assumed in our background knowledge, is a set of basic interactions between biological markers in any DNA sequence. This makes our approach applicable to analyze a wide variety of biological problems, including detection of cis-regulatory modules and analysis of ChIP-Sequencing experiments. We have evaluated our method in the context of such applications on two real world datasets as well as a number of specially designed synthetic datasets. The approach has shown to have merit even in situations when no significant classification could be determined.
APA, Harvard, Vancouver, ISO, and other styles
40

Gunewardena, Sumedha S. A. "Computational Tools for Identifying Functional Regions in Biological Sequences." Thesis, University of Oxford, 2004. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.491499.

Full text
Abstract:
Automated biological sequence annotation is a rapidly developing field. The need for computational tools to facilitate this exercise is of stark necessity given the rate at which new genetic sequences are being accumulated. This dissertation introduces new approaches to two fundamental problems in contemporary bioinformatics, and their synergic integration. They are described in two parts: In Part I, we introduce a novel approach based on templates to differentiate between transcription factor binding sites and non-binding sites. Templates model three key discriminatory features, sequence homology, structural homology and nucleotide polymorphisms present in various degrees in different transcription factor binding sites. We show how templates can be adopted to predict the actual binding affinity of a given binding site based on the distribution of binding affinities of a set of training sites. We also present examples demonstrating the excellen-t discriminative and predictive capabilities of templates for transcription factor binding sites. In Part II, we introduce a new framework for sequence alignment. Here, information is seen as information units that act upon the sequences being aligned rather than an intrinsic part of the sequences themselves. The result is a versatile alignment tool, a tool that can dynamically incorporate knowledge on demand to a sequence alignment. We describe efficient data structures that form an integral part of such alignment tool. The described data structures are efficient in terms of both storage and retrieval of information. We illustrate a hybrid alignment strategy geared towards accommodating the diversity of information available on the sequences being aligned. The alignment algorithms described are optimised over a combination of both the alignment of individual residue pairs and the alignment of sequence segments. We present examples demonstrating the versatility of the described alignment framework and the high quality of alignments that it produces.
APA, Harvard, Vancouver, ISO, and other styles
41

TRISTAO, CRISTIAN. "AN APPROACH TO MODEL, STORE AND ACCESS BIOLOGICAL SEQUENCES." PONTIFÍCIA UNIVERSIDADE CATÓLICA DO RIO DE JANEIRO, 2012. http://www.maxwell.vrac.puc-rio.br/Busca_etds.php?strSecao=resultado&nrSeq=21436@1.

Full text
Abstract:
PONTIFÍCIA UNIVERSIDADE CATÓLICA DO RIO DE JANEIRO
FUNDAÇÃO DE APOIO À PESQUISA DO ESTADO DO RIO DE JANEIRO
As pesquisas na área da biologia molecular vêm produzindo um grande volume de dados e estes precisam ser bem organizados, estruturados e persistidos. Na sua grande maioria os dados biológicos são armazenados em arquivos no formato texto. Para grandes volumes de dados, o caminho natural seria utilizar SGBDs para gerenciá-los. Contudo, estes sistemas não possuem estruturas adequadas para representar e manipular dados específicos ao domínio. Por exemplo, sequências biológicas normalmente são tratadas como simples cadeias de caracteres (tipo texto/varchar) ou BLOB, e desta forma perde-se todo um conjunto de informações composicionais, posicionais e de conteúdo. Esta tese argumenta que a gerência de dados (estrutura, armazenamento e acesso de dados) se transformou em um dos principais problemas para o domínio de pesquisas da bioinformática. Desta maneira propõe-se um modelo conceitual biológico para representar informações do dogma central da biologia molecular, bem como um tipo abstrato de dado (ADT – do inglês Abstract Data Types) específico para a manipulação de sequências biológicas e seus derivados.
The researches in molecular biology have been producing a large amount of data and they need to be well organized, structured and persisted. Mostly biological data are stored on files in text format. For large volumes of data, the natural way would be to use DBMS to manage them. However, these systems do not have adequate structures to represent and manipulate data specific to the domain. For example, biological sequences are typically treated as simple strings (type text/varchar) or BLOB, and thus lost a whole set of compositional, positional and content information. This thesis argues that the management of data (structure, storage and data access) has become a major problem for researches in bioinformatics. Thus we propose a conceptual model for representing biological information of the central dogma of molecular biology, as well as an Abstract Data Types (ADT) specific for the manipulation of biological sequences and its derivatives.
APA, Harvard, Vancouver, ISO, and other styles
42

BERNARDINI, GIULIA. "COMBINATORIAL METHODS FOR BIOLOGICAL DATA." Doctoral thesis, Università degli Studi di Milano-Bicocca, 2021. http://hdl.handle.net/10281/305220.

Full text
Abstract:
Lo scopo di questa tesi è di elaborare e analizzare metodi rigorosi dal punto di vista matematico per l’analisi di due tipi di dati biologici: dati relativi a pan-genomi e filogenesi. Con il termine “pan-genoma” si indica, in generale, un insieme di sequenze genomiche strettamente correlate (tipicamente appartenenti a individui della stessa specie) che si vogliano utilizzare congiuntamente come sequenze di riferimento per un’intera popolazione. Una filogenesi, invece, rappresenta le relazioni evolutive in un gruppo di entità, che siano esseri viventi, geni, lingue naturali, manoscritti antichi o cellule tumorali. Con l’eccezione di uno dei risultati presentati in questa tesi, relativo all’analisi di filogenesi tumorali, il taglio della dissertazione è prevalentemente teorico: lo scopo è studiare gli aspetti combinatori dei problemi affrontati, più che fornire soluzioni efficaci in pratica. Una conoscenza approfondita degli aspetti teorici di un problema, del resto, permette un'analisi matematicamente rigorosa delle soluzioni già esistenti, individuandone i punti deboli e quelli di forza, fornendo preziosi dettagli sul loro funzionamento e aiutando a decidere quali problemi vadano ulteriormente investigati. Oltretutto, è spesso il caso che nuovi risultati teorici (algoritmi, strutture dati o riduzioni ad altri problemi più noti) si possano direttamente applicare o adattare come soluzione ad un problema pratico, o come minimo servano ad ispirare lo sviluppo di nuovi metodi efficaci in pratica. La prima parte della tesi è dedicata a nuovi metodi per eseguire delle operazioni fondamentali su un testo elastico-degenerato, un oggetto computazionale che codifica in maniera compatta un insieme di testi simili tra loro, come, ad esempio, un pan-genoma. Nello specifico, si affrontano il problema di cercare una sequenza di lettere in un testo elastico-degenerato, sia in maniera esatta che tollerando un numero prefissato di errori, e quello di confrontare due testi degenerati. Nella seconda parte si considerano sia filogenesi tumorali, che ricostruiscono per l'appunto l'evoluzione di un tumore, sia filogenesi "classiche", che rappresentano, ad esempio, la storia evolutiva delle specie viventi. In particolare, si presentano nuove tecniche per confrontare due o più filogenesi tumorali, necessarie per valutare i risultati di diversi metodi che ricostruiscono le filogenesi stesse, e una nuova e più efficiente soluzione a un problema di lunga data relativo a filogenesi "classiche", consistente nel determinare se sia possibile sistemare, in presenza di dati mancanti, un insieme di specie in un albero filogenetico che abbia determinate proprietà.
The main goal of this thesis is to develop new algorithmic frameworks to deal with (i) a convenient representation of a set of similar genomes and (ii) phylogenetic data, with particular attention to the increasingly accurate tumor phylogenies. A “pan-genome” is, in general, any collection of genomic sequences to be analyzed jointly or to be used as a reference for a population. A phylogeny, in turn, is meant to describe the evolutionary relationships among a group of items, be they species of living beings, genes, natural languages, ancient manuscripts or cancer cells. With the exception of one of the results included in this thesis, related to the analysis of tumor phylogenies, the focus of the whole work is mainly theoretical, the intent being to lay firm algorithmic foundations for the problems by investigating their combinatorial aspects, rather than to provide practical tools for attacking them. Deep theoretical insights on the problems allow a rigorous analysis of existing methods, identifying their strong and weak points, providing details on how they perform and helping to decide which problems need to be further addressed. In addition, it is often the case where new theoretical results (algorithms, data structures and reductions to other well-studied problems) can either be directly applied or adapted to fit the model of a practical problem, or at least they serve as inspiration for developing new practical tools. The first part of this thesis is devoted to methods for handling an elastic-degenerate text, a computational object that compactly encodes a collection of similar texts, like a pan-genome. Specifically, we attack the problem of matching a sequence in an elastic-degenerate text, both exactly and allowing a certain amount of errors, and the problem of comparing two degenerate texts. In the second part we consider both tumor phylogenies, describing the evolution of a tumor, and “classical” phylogenies, representing, for instance, the evolutionary history of the living beings. In particular, we present new techniques to compare two or more tumor phylogenies, needed to evaluate the results of different inference methods, and we give a new, efficient solution to a longstanding problem on “classical” phylogenies: to decide whether, in the presence of missing data, it is possible to arrange a set of species in a phylogenetic tree that enjoys specific properties.
APA, Harvard, Vancouver, ISO, and other styles
43

Rausch, Tobias [Verfasser]. "Dissecting multiple sequence alignment methods : the analysis, design and development of generic multiple sequence alignment components in SeqAn / Tobias Rausch." Berlin : Freie Universität Berlin, 2010. http://d-nb.info/1024541460/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
44

Abu, Doleh Anas. "High Performance and Scalable Matching and Assembly of Biological Sequences." The Ohio State University, 2016. http://rave.ohiolink.edu/etdc/view?acc_num=osu1469092998.

Full text
APA, Harvard, Vancouver, ISO, and other styles
45

Behr, Jonathan Robert. "Novel tools for sequence and epitope analysis of glycosaminoglycans." Thesis, Massachusetts Institute of Technology, 2007. http://hdl.handle.net/1721.1/42383.

Full text
Abstract:
Thesis (Ph. D.)--Massachusetts Institute of Technology, Biological Engineering Division, 2007.
Includes bibliographical references.
Our understanding of glycosaminoglycan (GAG) biology has been limited by a lack of sensitive and efficient analytical tools designed to deal with these complex molecules. GAGs are heterogeneous and often sulfated linear polys accharides found throughout the extracellular environment, and available to researchers only in limited mixtures. A series of sensitive label-free analytical tools were developed to provide sequence information and to quantify whole epitopes from GAG mixtures. Three complementary sets of tools were developed to provide GAG sequence information. Two novel exolytic sulfatases from Flavobacterium heparinum that degrade heparan/heparan sulfate glycosaminoglycans (HSGAGs) were cloned and characterized. These exolytic enzymes enabled the exo-sequencing of a HSGAG oligosaccharide. Phenylboronic acids (PBAs) were specifically reacted with unsulfated chondroitin sulfate (CS) disaccharides from within a larger mixture. The resulting cyclic esters were easily detected in mass spectrometry (MS) using the distinct isotopic abundance of boron. Electrospray ionization tandem mass spectrometry (ESI-MSn) was employed to determine the fragmentation patterns of HSGAG disaccharides. These patterns were used to quantify relative amounts of isomeric disaccharides in a mixture. Fragmentation information is valuable for building methods for oligosaccharide sequencing, and the general method can be applied to quantify any isomers using MSn. Three other tools were developed to quantify GAG epitopes. Two microfluidic devices were characterized as HSGAG sensors. Sensors were functionalized either with protamine to quantify total HSGAGs or with antithrombin-III (AT-III) to quantify a specific anticoagulant epitope.
(cont.) A charge sensitive silicon field effect sensor accurately quantified clinically relevant anticoagulants including low molecular weight heparins (LMWH), even out of serum. A mass sensitive suspended microchannel resonator (SMR) measured the same clinically relevant HSGAGs. When these two sensors were compared, the SMR proved more robust and versatile. The SMR signal is more stable, it can be reused ad infinitum, and surface modifications can be automated and monitored. The field effect sensor provided an advantage in selectivity by preferentially detecting highly charged HSGAGs instead of any massive, non-specifically bound proteins. Lastly, anti-HSGAG single chain variable fragments (scFv) were evolved using yeast surface display towards generating antibodies for HSGAG epitope sensing and clinical GAG neutralization.
by Jonathan Robert Behr.
Ph.D.
APA, Harvard, Vancouver, ISO, and other styles
46

Maaskola, Jonas [Verfasser]. "Discriminative Learning for Probabilistic Sequence Analysis / Jonas Maaskola." Berlin : Freie Universität Berlin, 2015. http://d-nb.info/1074139488/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
47

Shadrin, Alexey [Verfasser]. "Positional Information Storage in Sequence Patterns / Alexey Shadrin." Berlin : Freie Universität Berlin, 2014. http://d-nb.info/1060368056/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
48

Mamer, Thierry. "A sequence-length sensitive approach to learning biological grammars using inductive logic programming." Thesis, Robert Gordon University, 2011. http://hdl.handle.net/10059/662.

Full text
Abstract:
This thesis aims to investigate if the ideas behind compression principles, such as the Minimum Description Length, can help us to improve the process of learning biological grammars from protein sequences using Inductive Logic Programming (ILP). Contrary to most traditional ILP learning problems, biological sequences often have a high variation in their length. This variation in length is an important feature of biological sequences which should not be ignored by ILP systems. However we have identified that some ILP systems do not take into account the length of examples when evaluating their proposed hypotheses. During the learning process, many ILP systems use clause evaluation functions to assign a score to induced hypotheses, estimating their quality and effectively influencing the search. Traditionally, clause evaluation functions do not take into account the length of the examples which are covered by the clause. We propose L-modification, a way of modifying existing clause evaluation functions so that they take into account the length of the examples which they learn from. An empirical study was undertaken to investigate if significant improvements can be achieved by applying L-modification to a standard clause evaluation function. Furthermore, we generally investigated how ILP systems cope with the length of examples in training data. We show that our L-modified clause evaluation function outperforms our benchmark function in every experiment we conducted and thus we prove that L-modification is a useful concept. We also show that the length of the examples in the training data used by ILP systems does have an undeniable impact on the results.
APA, Harvard, Vancouver, ISO, and other styles
49

Törnkvist, Maria. "Synovial sarcoma : molecular, biological and clinical implications /." Stockholm, 2004. http://diss.kib.ki.se/2004/91-7140-024-9/.

Full text
APA, Harvard, Vancouver, ISO, and other styles
50

Shenoy, Nalini. "Investigation of the replacement of cysteine residues in DOTA-(Tyr³)-octreotate synthesis, characterization and evaluation of biological activities /." Diss., Columbia, Mo. : University of Missouri-Columbia, 2006. http://hdl.handle.net/10355/4440.

Full text
Abstract:
Thesis (Ph. D.) University of Missouri-Columbia, 2006.
The entire dissertation/thesis text is included in the research.pdf file; the official abstract appears in the short.pdf file (which also appears in the research.pdf); a non-technical general description, or public abstract, appears in the public.pdf file. Title from title screen of research.pdf file (viewed on August 8, 2007) In the 520 where natIn-DOTA⁰ appears nat should be superscripted. Includes bibliographical references.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography