Dissertations / Theses: 'Protein sequence alignment'

1

Abhiman, Saraswathi. "Prediction of function shift in protein families /." Stockholm, 2006. http://diss.kib.ki.se/2006/91-7140-869-X/.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Carroll, Hyrum D. "Biologically Relevant Multiple Sequence Alignment." Diss., CLICK HERE for online access, 2008. http://contentdm.lib.byu.edu/ETD/image/etd2623.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

3

Talbot, Danielle. "Identifying misalignments in sequence alignment for protein modelling." Thesis, University of Reading, 2005. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.445754.

Full text

APA, Harvard, Vancouver, ISO, and other styles

4

Garriga, Nogales Edgar 1990. "New algorithmic contributions for large scale multiple sequence alignments of protein sequences." Doctoral thesis, TDX (Tesis Doctorals en Xarxa), 2022. http://hdl.handle.net/10803/673526.

Full text

Abstract:

In these days of significant changes and the rapid evolution of technology, the amount of datascience has to deal with the growth incredibly fast, and the size of data could be prohibitive.Multiple Sequence Alignments (MSA) are used in various areas of biology, and the increase ofdata has produced a degradation of the methods. That is why is proposed a new solution toperform the MSA. This novel paradigm allows the alignment of millions of sequences and theability to modularize the process. Regressive enables the parallelization of the process and thecombination of clustering methods (guide-tree) with whatever aligner is desired. On theclustering side, the guide-tree has to be rethought. A study of the current state of the methodsand their strength and weaknesses have been performed to shed some light on the topic. Theguide-tree cannot be the bottleneck, and it should provide a good starting point for the aligners.
En aquests dies de profunds canvis i una ràpida evolució de la tecnologia, la quantitat de dataque la ciència ha de treballar ha crescut increïblement ràpid i la grandària dels arxius ha crescutde manera quasi prohibitiva.Els alineaments múltiples de seqüència (MSA) es fan servir endiverses àrees de la biologia, i l'increment de les dades ha produït una degradació delsresultats. És per això, que es proposa una nova estratègia per realitzar els alineaments. Aquestnou paradigma permet alinear milions de seqüències i l'opcio de modularitzar el procés.'Regressive' permet la paral·lelització del procés i la combinació de diferents algoritmesd'agrupacio (guide-tree) amb el mètode de alineament que és desitgi. Dins del camp del'agrupació, s'ha de repensar l'estratègia per crear els guide-tree. Un estudi sobre l'estat actualdels mètodes i les seves virtuts i punts febles ha sigut realitzar per llençar una mica de llum enaquesta àrea. Els 'guide-tree' no poden ser el coll de botella, i haurien de servir per començarde la millor manera possible el procés d'alineament.

APA, Harvard, Vancouver, ISO, and other styles

5

Bonneau, Richard A. "Gene annotation using Ab initio protein structure prediction : method development and application to major protein families /." Thesis, Connect to this title online; UW restricted, 2001. http://hdl.handle.net/1773/9241.

Full text

APA, Harvard, Vancouver, ISO, and other styles

6

Lassmann, Timo. "Algorithms for building and evaluating multiple sequence alignments /." Stockholm, 2006. http://diss.kib.ki.se/2006/91-7140-887-8/.

Full text

APA, Harvard, Vancouver, ISO, and other styles

7

Hollich, Volker. "Orthology and protein domain architecture evolution /." Stockholm, 2006. http://diss.kib.ki.se/2006/91-7140-783-9/.

Full text

APA, Harvard, Vancouver, ISO, and other styles

8

Li, Yuheng. "Searching for remotely homologous sequences in protein databases with hybrid PSI-blast." The Ohio State University, 2006. http://rave.ohiolink.edu/etdc/view?acc_num=osu1164741421.

Full text

APA, Harvard, Vancouver, ISO, and other styles

9

DeBlasio, Dan, and John Kececioglu. "Core column prediction for protein multiple sequence alignments." BIOMED CENTRAL LTD, 2017. http://hdl.handle.net/10150/623957.

Full text

Abstract:

Background: In a computed protein multiple sequence alignment, the coreness of a column is the fraction of its substitutions that are in so-called core columns of the gold-standard reference alignment of its proteins. In benchmark suites of protein reference alignments, the core columns of the reference alignment are those that can be confidently labeled as correct, usually due to all residues in the column being sufficiently close in the spatial superposition of the known three-dimensional structures of the proteins. Typically the accuracy of a protein multiple sequence alignment that has been computed for a benchmark is only measured with respect to the core columns of the reference alignment. When computing an alignment in practice, however, a reference alignment is not known, so the coreness of its columns can only be predicted. Results: We develop for the first time a predictor of column coreness for protein multiple sequence alignments. This allows us to predict which columns of a computed alignment are core, and hence better estimate the alignment's accuracy. Our approach to predicting coreness is similar to nearest-neighbor classification from machine learning, except we transform nearest-neighbor distances into a coreness prediction via a regression function, and we learn an appropriate distance function through a new optimization formulation that solves a large-scale linear programming problem. We apply our coreness predictor to parameter advising, the task of choosing parameter values for an aligner's scoring function to obtain a more accurate alignment of a specific set of sequences. We show that for this task, our predictor strongly outperforms other column-confidence estimators from the literature, and affords a substantial boost in alignment accuracy.

APA, Harvard, Vancouver, ISO, and other styles

10

Aniba, Mohamed Radhouane. "Knowledge based expert system development in bioinformatics : applied to multiple sequence alignment of protein sequences." Strasbourg, 2010. https://publication-theses.unistra.fr/public/theses_doctorat/2010/ANIBA_Mohamed_Radhouane_2010.pdf.

Full text

Abstract:

L'objectif de ce projet de thèse a été le développement d'un système expert afin de tester, évaluer et d'optimiser toutes les étapes de la construction et l'analyse d'un alignement multiple de séquences. Le nouveau système a été validé en utilisant des alignements de référence et apporte une nouvelle vision pour le développement de logiciels en bioinformatique: les systèmes experts basés sur la connaissance. L'architecture utilisée pour construire le système expert est très modulaire et flexible, permettant à AlexSys d'évoluer en même temps que de nouveaux algorithmes seront mis à disposition. Ultérieurement, AlexSys sera utilisé pour optimiser davantage chaque étape du processus d'alignement, par exemple en optimisant les paramètres des différents programmes d 'alignement. Le moteur d'inférence pourrait également être étendu à identification des combinaisons d'algorithmes qui pourraient fournir des informations complémentaires sur les séquences. Par exemple, les régions bien alignées par différents algorithmes pourraient être identifiées et regroupées en un alignement consensus unique. Des informations structurales et fonctionnelles supplémentaires peuvent également être utilisées pour améliorer la précision de l'alignement final. Enfin, un aspect crucial de tout outil bioinformatique consiste en son accessibilité et la convivialité d' utilisation. Par conséquent, nous sommes en train de développer un serveur web, et un service web, nous allons également concevoir un nouveau module de visualisation qui fournira une interface intuitive et conviviale pour toutes les informa ions récupérées et construites par AlexSys
The objective of this PhD project was the development of an integrated expert system to test, evaluate and optimize all the stages of the construction and the analysis of a multiple sequence alignment. The new system was validated using standard benchmark cases and brings a ncw vision to software development in Bioinformatics: knowledge-guided systems. The architecture used to build the expert system is highly modular and flcxible, allowing AlcxSys to evolve as new algorithms are made available. In the future, AlexSys will he uscd to furthcr optimize each stage of the alignment process, for example by optimizing the input parameters of the different algorithms. The inference engine could also be extended to identify combinations of algorithms that could potentially provide complementary information about the input sequences. For example, well aligned regions from different aligners could be identified and combined into a single consensus alignment. Additional structural and functional information could also be exploited to improve the final alignment accuracy. Finally, a crucial aspect of any bioinformatics tool is its accessibility and usability. Therefore, we are currently developing a web server, and a web services based distributed system. We will also design a novel visualization module that will provide an intuitive, user-friendly interface to all the information retrieved and constructed by AlexSys

APA, Harvard, Vancouver, ISO, and other styles

11

Madangopal, Sangeetha. "Comparison of Methods Used for Aligning Protein Sequences." Digital Archive @ GSU, 2006. http://digitalarchive.gsu.edu/cs_theses/30.

Full text

Abstract:

Comparing protein sequences is an essential procedure that has many applications in the field of bioinformatics. The recent advances in computational capabilities and algorithm design, simplified the comparison procedure of protein sequences from several databases. Various algorithms have emerged using state of the art approaches to match protein sequences based on structural and functional properties of the amino acids. The matching involves structural alignment, and this alignment may be global; comprising of the whole length of the protein, or local; comprising of the sub-sequences of the proteins. Families of related proteins are found by clustering sequence alignments. The frequency distributions of the amino acids within these different clusters define the sequence profile. The best alignment algorithm uses these profiles. In this thesis, we have studied different profile alignment algorithms where the cost function for comparing two profiles is changed. These are compared to the FFAS3 (Fold and Function Assignment) algorithm.

APA, Harvard, Vancouver, ISO, and other styles

12

Zhao, Zhiyu. "Robust and Efficient Algorithms for Protein 3-D Structure Alignment and Genome Sequence Comparison." ScholarWorks@UNO, 2008. http://scholarworks.uno.edu/td/851.

Full text

Abstract:

Sequence analysis and structure analysis are two of the fundamental areas of bioinformatics research. This dissertation discusses, specifically, protein structure related problems including protein structure alignment and query, and genome sequence related problems including haplotype reconstruction and genome rearrangement. It first presents an algorithm for pairwise protein structure alignment that is tested with structures from the Protein Data Bank (PDB). In many cases it outperforms two other well-known algorithms, DaliLite and CE. The preliminary algorithm is a graph-theory based approach, which uses the concept of \stars" to reduce the complexity of clique-finding algorithms. The algorithm is then improved by introducing \double-center stars" in the graph and applying a self-learning strategy. The updated algorithm is tested with a much larger set of protein structures and shown to be an improvement in accuracy, especially in cases of weak similarity. A protein structure query algorithm is designed to search for similar structures in the PDB, using the improved alignment algorithm. It is compared with SSM and shows better performance with lower maximum and average Q-score for missing proteins. An interesting problem dealing with the calculation of the diameter of a 3-D sequence of points arose and its connection to the sublinear time computation is discussed. The diameter calculation of a 3-D sequence is approximated by a series of sublinear time deterministic, zero-error and bounded-error randomized algorithms and we have obtained a series of separations about the power of sublinear time computations. This dissertation also discusses two genome sequence related problems. A probabilistic model is proposed for reconstructing haplotypes from SNP matrices with incomplete and inconsistent errors. The experiments with simulated data show both high accuracy and speed, conforming to the theoretically provable e ciency and accuracy of the algorithm. Finally, a genome rearrangement problem is studied. The concept of non-breaking similarity is introduced. Approximating the exemplar non-breaking similarity to factor n1..f is proven to be NP-hard. Interestingly, for several practical cases, several polynomial time algorithms are presented.

APA, Harvard, Vancouver, ISO, and other styles

13

Ohlson, Tomas. "The use of evolutionary information in protein alignments and homology identification." Doctoral thesis, Stockholm : Stockholm Bioinformatics Center, Stockholm University, 2006. http://urn.kb.se/resolve?urn=urn:nbn:se:su:diva-812.

Full text

APA, Harvard, Vancouver, ISO, and other styles

14

Tångrot, Jeanette. "Structural Information and Hidden Markov Models for Biological Sequence Analysis." Doctoral thesis, Umeå universitet, Institutionen för datavetenskap, 2008. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-1629.

Full text

Abstract:

Bioinformatics is a fast-developing field, which makes use of computational methods to analyse and structure biological data. An important branch of bioinformatics is structure and function prediction of proteins, which is often based on finding relationships to already characterized proteins. It is known that two proteins with very similar sequences also share the same 3D structure. However, there are many proteins with similar structures that have no clear sequence similarity, which make it difficult to find these relationships. In this thesis, two methods for annotating protein domains are presented, one aiming at assigning the correct domain family or families to a protein sequence, and the other aiming at fold recognition. Both methods use hidden Markov models (HMMs) to find related proteins, and they both exploit the fact that structure is more conserved than sequence, but in two different ways. Most of the research presented in the thesis focuses on the structure-anchored HMMs, saHMMs. For each domain family, an saHMM is constructed from a multiple structure alignment of carefully selected representative domains, the saHMM-members. These saHMM-members are collected in the so called "midnight ASTRAL set", and are chosen so that all saHMM-members within the same family have mutual sequence identities below a threshold of about 20%. In order to construct the midnight ASTRAL set and the saHMMs, a pipe-line of software tools are developed. The saHMMs are shown to be able to detect the correct family relationships at very high accuracy, and perform better than the standard tool Pfam in assigning the correct domain families to new domain sequences. We also introduce the FI-score, which is used to measure the performance of the saHMMs, in order to select the optimal model for each domain family. The saHMMs are made available for searching through the FISH server, and can be used for assigning family relationships to protein sequences. The other approach presented in the thesis is secondary structure HMMs (ssHMMs). These HMMs are designed to use both the sequence and the predicted secondary structure of a query protein when scoring it against the model. A rigorous benchmark is used, which shows that HMMs made from multiple sequences result in better fold recognition than those based on single sequences. Adding secondary structure information to the HMMs improves the ability of fold recognition further, both when using true and predicted secondary structures for the query sequence.
Bioinformatik är ett område där datavetenskapliga och statistiska metoder används för att analysera och strukturera biologiska data. Ett viktigt område inom bioinformatiken försöker förutsäga vilken tredimensionell struktur och funktion ett protein har, utifrån dess aminosyrasekvens och/eller likheter med andra, redan karaktäriserade, proteiner. Det är känt att två proteiner med likande aminosyrasekvenser också har liknande tredimensionella strukturer. Att två proteiner har liknande strukturer behöver dock inte betyda att deras sekvenser är lika, vilket kan göra det svårt att hitta strukturella likheter utifrån ett proteins aminosyrasekvens. Den här avhandlingen beskriver två metoder för att hitta likheter mellan proteiner, den ena med fokus på att bestämma vilken familj av proteindomäner, med känd 3D-struktur, en given sekvens tillhör, medan den andra försöker förutsäga ett proteins veckning, d.v.s. ge en grov bild av proteinets struktur. Båda metoderna använder s.k. dolda Markov modeller (hidden Markov models, HMMer), en statistisk metod som bland annat kan användas för att beskriva proteinfamiljer. Med hjälp en HMM kan man förutsäga om en viss proteinsekvens tillhör den familj modellen representerar. Båda metoderna använder också strukturinformation för att öka modellernas förmåga att känna igen besläktade sekvenser, men på olika sätt. Det mesta av arbetet i avhandlingen handlar om strukturellt förankrade HMMer (structure-anchored HMMs, saHMMer). För att bygga saHMMerna används strukturbaserade sekvensöverlagringar, vilka genereras utifrån hur proteindomänerna kan läggas på varandra i rymden, snarare än utifrån vilka aminosyror som ingår i deras sekvenser. I varje proteinfamilj används bara ett särskilt, representativt urval av domäner. Dessa är valda så att då sekvenserna jämförs parvis, finns det inget par inom familjen med högre sekvensidentitet än ca 20%. Detta urval görs för att få så stor spridning som möjligt på sekvenserna inom familjen. En programvaruserie har utvecklats för att välja ut representanter för varje familj och sedan bygga saHMMer baserade på dessa. Det visar sig att saHMMerna kan hitta rätt familj till en hög andel av de testade sekvenserna, med nästan inga fel. De är också bättre än den ofta använda metoden Pfam på att hitta rätt familj till helt nya proteinsekvenser. saHMMerna finns tillgängliga genom FISH-servern, vilken alla kan använda via Internet för att hitta vilken familj ett intressant protein kan tillhöra. Den andra metoden som presenteras i avhandlingen är sekundärstruktur-HMMer, ssHMMer, vilka är byggda från vanliga multipla sekvensöverlagringar, men också från information om vilka sekundärstrukturer proteinsekvenserna i familjen har. När en proteinsekvens jämförs med ssHMMen används en förutsägelse om sekundärstrukturen, och den beräknade sannolikheten att sekvensen tillhör familjen kommer att baseras både på sekvensen av aminosyror och på sekundärstrukturen. Vid en jämförelse visar det sig att HMMer baserade på flera sekvenser är bättre än sådana baserade på endast en sekvens, när det gäller att hitta rätt veckning för en proteinsekvens. HMMerna blir ännu bättre om man också tar hänsyn till sekundärstrukturen, både då den riktiga sekundärstrukturen används och då man använder en teoretiskt förutsagd.
Jeanette Hargbo.

APA, Harvard, Vancouver, ISO, and other styles

15

Ng, Pauline Crystal. "PSSMs : not just roadkill on the information superhighway /." Thesis, Connect to this title online; UW restricted, 2002. http://hdl.handle.net/1773/8116.

Full text

APA, Harvard, Vancouver, ISO, and other styles

16

Kemena, Carsten 1983. "Improving the accuracy and the efficiency of multiple sequence alignment methods." Doctoral thesis, Universitat Pompeu Fabra, 2012. http://hdl.handle.net/10803/128678.

Full text

Abstract:

Sequence alignment is one of the basic methods to compare biological sequences and the cornerstone of a wide range of different analyses. Due to this privileged position at the beginning of many studies its accuracy is of great importance, in fact, each result based on an alignment is depending on the alignment quality. This has been confirmed in several recent papers investigating the effect of alignment methods on phylogenetic reconstruction and the estimation of positive selection. In this thesis, I present several projects dedicated to the problem of developing more accurate multiple sequence alignments and how to evaluate them. I addressed the problem of structural protein alignment evaluation, the accurate structural alignment of RNA sequences and the alignment of large sequence data sets.
El alineamiento es uno de los métodos básicos en la comparación de secuencias biológicas, y a menudo el primer pasó en análisis posteriores. Por su posición privilegiada al principio de muchos estudios, la calidad del alineamiento es de gran importancia, de hecho cada resultado basado en un alineamiento depende en gran medida de la calidad de ´este. Este hecho se ha confirmado en diversos artículos recientes, en los cuales se ha investigado los efectos de la elección del método de alineamiento en la reconstrucción filogenética y la estimación de la selección positiva. En esta tesis, presento varios proyectos enfocados en la implementación de mejoras tanto en los métodos de alineamiento múltiple de secuencias como en la evaluación de estos. Concretamente, he tratado problemas como la evaluación de alineamientos estructurales de proteínas, la construcción de alineamientos estructurales y precisos de ARN y también el alineamiento de grandes conjuntos de secuencias.

APA, Harvard, Vancouver, ISO, and other styles

17

Hu, Junbin. "Structural and functional studies on heat shock protein Hsp40-Hdj1 and Golgi ER trafficking protein Get3." Thesis, Birmingham, Ala. : University of Alabama at Birmingham, 2009. https://www.mhsl.uab.edu/dt/2009p/huj.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

18

Johansson, Joakim. "Modifying a Protein-Protein Interaction Identifier with a Topology and Sequence-Order Independent Structural Comparison Method." Thesis, Linköpings universitet, Bioinformatik, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-147777.

Full text

Abstract:

Using computational methods to identify protein-protein interactions (PPIs) supports experimental techniques by using less time and less resources. Identifying PPIs can be made through a template-based approach that describes how unstudied proteins interact by aligning a common structural template that exists in both interacting proteins. A pipeline that uses this is InterPred, that combines homology modelling and massive template comparison to construct coarse interaction models. These models are reviewed by a machine learning classifier that classifies models that shows traits of being true, which can be further refined with a docking technique. However, InterPred is dependent on using complex structural information, that might not be available from unstudied proteins, while it is suggested that PPIs are dependent of the shape and interface of proteins. A method that aligns structures based on the interface attributes is InterComp, which uses topological and sequence-order independent structural comparison. Implementing this method into InterPred will lead to restricting structural information to the interface of proteins, which could lead to discovery of undetected PPI models. The result showed that the modified pipeline was not comparable based on the receiver operating characteristic (ROC) performance. However, the modified pipeline could identify new potential PPIs that were undetected by InterPred.

APA, Harvard, Vancouver, ISO, and other styles

19

Ozer, Hatice Gulcin. "Residue Associations In Protein Family Alignments." The Ohio State University, 2008. http://rave.ohiolink.edu/etdc/view?acc_num=osu1211570026.

Full text

APA, Harvard, Vancouver, ISO, and other styles

20

Menlove, Kit J. "Model Detection Based upon Amino Acid Properties." BYU ScholarsArchive, 2010. https://scholarsarchive.byu.edu/etd/2253.

Full text

Abstract:

Similarity searches are an essential component to most bioinformatic applications. They form the bases of structural motif identification, gene identification, and insights into functional associations. With the rapid increase in the available genetic data through a wide variety of databases, similarity searches are an essential tool for accessing these data in an informative and productive way. In our chapter, we provide an overview of similarity searching approaches, related databases, and parameter options to achieve the best results for a variety of applications. We then provide a worked example and some notes for consideration. Homology detection is one of the most basic and fundamental problems at the heart of bioinformatics. It is central to problems currently under intense investigation in protein structure prediction, phylogenetic analyses, and computational drug development. Currently discriminative methods for homology detection, which are not readily interpretable, are substantially more powerful than their more interpretable counterparts, particularly when sequence identity is very low. Here I present a computational graph-based framework for homology inference using physiochemical amino acid properties which aims to both reduce the gap in accuracy between discriminative and generative methods and provide a framework for easily identifying the physiochemical basis for the structural similarity between proteins. The accuracy of my method slightly improves on the accuracy of PSI-BLAST, the most popular generative approach, and underscores the potential of this methodology given a more robust statistical foundation.

APA, Harvard, Vancouver, ISO, and other styles

21

Nobili, Alberto [Verfasser]. "Improving biocatalysts via semi-rational protein design : use of a multiple sequence alignment platform to reduce screening efforts and facilitate hit identification / Alberto Nobili." Greifswald : Universitätsbibliothek Greifswald, 2016. http://d-nb.info/1113294191/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

22

Cao, Haibo. "Protein Structure Recognition From Eigenvector Analysis to Structural Threading Method." Washington, D.C. : Oak Ridge, Tenn. : United States. Dept. of Energy. Office of Science ; distributed by the Office of Scientific and Technical Information, U.S. Dept. of Energy, 2003. http://www.osti.gov/servlets/purl/822060-2L2Xvm/native/.

Full text

Abstract:

Thesis (Ph.D.); Submitted to Iowa State Univ., Ames, IA (US); 12 Dec 2003.
Published through the Information Bridge: DOE Scientific and Technical Information. "IS-T 2028" Haibo Cao. 12/12/2003. Report is also available in paper and microfiche from NTIS.

APA, Harvard, Vancouver, ISO, and other styles

23

Gomes, Mireille. "Role of mutual information for predicting contact residues in proteins." Thesis, University of Oxford, 2012. http://ora.ox.ac.uk/objects/uuid:5ec3c90c-73fb-494f-ad2e-efc718406aa4.

Full text

Abstract:

Mutual Information (MI) based methods are used to predict contact residues within proteins and between interacting proteins. There have been many high impact papers citing the successful use of MI for determining contact residues in a particular protein of interest, or in certain types of proteins, such as homotrimers. In this dissertation we have carried out a systematic study to assess if this popularly employed contact prediction tool is useful on a global scale. After testing original MI and leading MI based methods on large, cross-species datasets we found that in general the performance of these methods for predicting contact residues both within (intra-protein) and between proteins (inter-protein) is weak. We observe that all MI variants have a bias towards surface residues, and therefore predict surface residues instead of contact residues. This finding is in contrast to the relatively good performance of i-Patch (Hamer et al. [2010]), a statistical scoring tool for inter-protein contact prediction. i-Patch uses as input surface residues only, groups amino acids by physiochemical properties, and assumes the existence of patches of contact residues on interacting proteins. We examine whether using these ideas would improve the performance of MI. Since inter-protein contact residues are only on the surface of each protein, to disentangle surface from contact prediction we filtered out the confounding buried residues. We observed that considering surface residues only does indeed improve the interprotein contact prediction ability of all tested MI methods. We examined a specific "successful" case study in the literature and demonstrated that here, even when considering surface residues only, the most accurate MI based inter-protein contact predictor,MIc, performs no better than random. We have developed two novel MI variants; the first groups amino acids by their physiochemical properties, and the second considers patches of residues on the interacting proteins. In our analyses these new variants highlight the delicate trade-off between signal and noise that must be achieved when using MI for inter-protein contact prediction. The input for all tested MI methods is a multiple sequence alignment of homologous proteins. In a further attempt to understand why the MI methods perform poorly, we have investigated the influence of gaps in the alignment on intra-protein contact prediction. Our results suggest that depending on the evaluation criteria and the alignment construction algorithm employed, a gap cutoff of around 10% would maximise the performance of MI methods, whereas the popularly employed 0% gap cutoff may lead to predictions that are no better than random guesses. Based on the insight we have gained through our analyses, we end this dissertation by identifying a number of ways in which the contact residue prediction ability of MI variants may be improved, including direct coupling analysis.

APA, Harvard, Vancouver, ISO, and other styles

24

Pinheiro, Ana Rita Almeida. "Extracellular enzymes of Botryosphaeriaceae family." Master's thesis, Universidade de Aveiro, 2015. http://hdl.handle.net/10773/17307.

Full text

Abstract:

Mestrado em Bioquímica
As espécies da família Botryosphaeriaceae são morfologicamente diversas e descritas como endofíticas, patogénias e saprófitas. Estas são normalmente encontradas numa grande diversidade de hospedeiros. Os fungos patogénicos para plantas Macrophomina phaseolina, Neofusicoccum parvum e Diplodia corticola secretam uma variedade de enzimas extracelulares, tais como proteases e glicosil hidrolases, algumas das quais envolvidas na interação hospedeiro-patogénio. A fim de elucidar a correlação entre microrganismo secretoma-hospedeiro, foi comparado entre estes organismos a quantidade de sequências que codificam para enzimas tais como proteases extracelulares e glicosil hidrolases (xilanases e endoglucanases). Através de ferramentas bioinformáticas, tais como, Clustal X2 e T-Coffee, foi realizado o alinhamento múltiplo de sequências dos domínios das proteínas. Além disso, para estudar a relação evolutiva entre as sequências de proteínas foram construídas árvores filogenéticas utilizando a ferramenta MEGA. Entre M. phaseolina, N. parvum e D. corticola, o genoma de D. corticola contém genes que codificam para uma maior diversidade de famílias glicosil hidrolases sugerindo uma melhor capacidade de adaptação durante sua interação com espécies hospedeiras. A similaridade de sequências observada no alinhamento múltiplo de sequências entre M. phaseolina, N. parvum e D. corticola é explicado pela sua relação evolutiva e não pelo hospedeiro de cada um. A análise filogenética demonstra que a nível evolutivo, M. phaseolina e D. corticola estão mais próximos entre si do que a N. parvum.
Species of the Botryosphaeriaceae family are morphologically diverse and are described as endophytes, pathogens and saprophytes. They are commonly found in a wide range of hosts. The plant pathogenic fungi Macrophomina phaseolina, Neofusicoccum parvum and Diplodia corticola secrete a variety of extracellular enzymes, such as proteases and glycoside hydrolases, some of which are involved in host-pathogen interaction. In order to elucidate the correlation microorganism secretome-host, the amount of sequences encoding extracellular enzymes such as proteases and glycoside hydrolase (xylanases and endoglucanases) was compared between organisms. Through bioinformatics tools, namely Clustal X2 and T-Coffee, multiple sequence alignment of the protein domains was performed. Furthermore, to study the phylogenetic relationship between protein sequences, phylogenetic trees were constructed using MEGA tool. Between M. phaseolina, N. parvum and D. corticola, D. corticola genome contains genes that encode a larger diversity of glycoside hydrolase families suggesting a better capacity for adaptability during its interaction with host species. The sequence similarity observed in the multiple sequence alignment between M. phaseolina, N. parvum and D. corticola is explained by the evolutionary relationship and not by their host type. The phylogenetic analysis shows that at the evolutionary level, M. phaseolina and D. corticola are closer to each other than to N. parvum.

APA, Harvard, Vancouver, ISO, and other styles

25

Ho, Ngai-lam, and 何毅林. "Algorithms on constrained sequence alignment." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2004. http://hub.hku.hk/bib/B30201949.

Full text

APA, Harvard, Vancouver, ISO, and other styles

26

Cunial, Fabio. "Analysis of the subsequence composition of biosequences." Diss., Georgia Institute of Technology, 2012. http://hdl.handle.net/1853/44716.

Full text

Abstract:

Measuring the amount of information and of shared information in biological strings, as well as relating information to structure, function and evolution, are fundamental computational problems in the post-genomic era. Classical analyses of the information content of biosequences are grounded in Shannon's statistical telecommunication theory, while the recent focus is on suitable specializations of the notions introduced by Kolmogorov, Chaitin and Solomonoff, based on data compression and compositional redundancy. Symmetrically, classical estimates of mutual information based on string editing are currently being supplanted by compositional methods hinged on the distribution of controlled substructures. Current compositional analyses and comparisons of biological strings are almost exclusively limited to short sequences of contiguous solid characters. Comparatively little is known about longer and sparser components, both from the point of view of their effectiveness in measuring information and in separating biological strings from random strings, and from the point of view of their ability to classify and to reconstruct phylogenies. Yet, sparse structures are suspected to grasp long-range correlations and, at short range, they are known to encode signatures and motifs that characterize molecular families. In this thesis, we introduce and study compositional measures based on the repertoire of distinct subsequences of any length, but constrained to occur with a predefined maximum gap between consecutive symbols. Such measures highlight previously unknown laws that relate subsequence abundance to string length and to the allowed gap, across a range of structurally and functionally diverse polypeptides. Measures on subsequences are capable of separating only few amino acid strings from their random permutations, but they reveal that random permutations themselves amass along previously undetected, linear loci. This is perhaps the first time in which the vocabulary of all distinct subsequences of a set of structurally and functionally diverse polypeptides is systematically counted and analyzed. Another objective of this thesis is measuring the quality of phylogenies based on the composition of sparse structures. Specifically, we use a set of repetitive gapped patterns, called motifs, whose length and sparsity have never been considered before. We find that extremely sparse motifs in mitochondrial proteomes support phylogenies of comparable quality to state-of-the-art string-based algorithms. Moving from maximal motifs -- motifs that cannot be made more specific without losing support -- to a set of generators with decreasing size and redundancy, generally degrades classification, suggesting that redundancy itself is a key factor for the efficient reconstruction of phylogenies. This is perhaps the first time in which the composition of all motifs of a proteome is systematically used in phylogeny reconstruction on a large scale. Extracting all maximal motifs, or even their compact generators, is infeasible for entire genomes. In the last part of this thesis, we study the robustness of measures of similarity built around the dictionary of LZW -- the variant of the LZ78 compression algorithm proposed by Welch -- and of some of its recently introduced gapped variants. These algorithms use a very small vocabulary, they perform linearly in the input strings, and they can be made even faster than LZ77 in practice. We find that dissimilarity measures based on maximal strings in the dictionary of LZW support phylogenies that are comparable to state-of-the-art methods on test proteomes. Introducing a controlled proportion of gaps does not degrade classification, and allows to discard up to 20% of each input proteome during comparison.

APA, Harvard, Vancouver, ISO, and other styles

27

Santos-Ciminera, Patricia Dantas Ciminera Patricia Dantas Santos Santos Patricia. "Molecular epidemiology of epidemic severe malaria caused by Plasmodium vivax in the state of Amazonas, Brazil /." Download the dissertation in PDF, 2005. http://www.lrc.usuhs.mil/dissertations/pdf/Santos2005.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

28

Tress, Michael. "Towards improving the accuracy of GenTHREADER alignments." Thesis, University of Warwick, 2002. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.247983.

Full text

APA, Harvard, Vancouver, ISO, and other styles

29

Midic, Uros. "Genome-Wide Prediction of Intrinsic Disorder; Sequence Alignment of Intrinsically Disordered Proteins." Diss., Temple University Libraries, 2012. http://cdm16002.contentdm.oclc.org/cdm/ref/collection/p245801coll10/id/159800.

Full text

Abstract:

Computer and Information Science
Ph.D.
Intrinsic disorder (ID) is defined as a lack of stable tertiary and/or secondary structure under physiological conditions in vitro. Intrinsically disordered proteins (IDPs) are highly abundant in nature. IDPs possess a number of crucial biological functions, being involved in regulation, recognition, signaling and control, e.g. their functional repertoire complements the functions of ordered proteins. Intrinsically disordered regions (IDRs) of IDPs have a different amino-acid composition than structured regions and proteins. This fact has been exploited for development of predictors of ID; the best predictors currently achieve around 80% per-residue accuracy. Earlier studies revealed that some IDPs are associated with various human diseases, including cancer, cardiovascular disease, amyloidoses, neurodegenerative diseases, diabetes and others. We developed a methodology for prediction and analysis of abundance of intrinsic disorder on the genome scale, which combines data from various gene and protein databases, and utilizes several ID prediction tools. We used this methodology to perform a large-scale computational analysis of the abundance of (predicted) ID in transcripts of various classes of disease-related genes. We further analyzed the relationships between ID and the occurrence of alternative splicing and Molecular Recognition Features (MoRFs) in human disease classes. An important, never before addressed issue with such genome-wide applications of ID predictors is that - for less-studied organisms - in addition to the experimentally confirmed protein sequences, there is a large number of putative sequences, which have been predicted with automated annotation procedures and lack experimental confirmation. In the human genome, these predicted sequences have significantly higher predicted disorder content. I investigated a hypothesis that this discrepancy is not correct, and that it is due to incorrectly annotated parts of the putative protein sequences that exhibit some similarities to confirmed IDRs, which lead to high predicted ID content. I developed a procedure to create synthetic nonsense peptide sequences by translation of non-coding regions of genomic sequences and translation of coding regions with incorrect codon alignment. I further trained several classifiers to discriminate between confirmed sequences and synthetic nonsense sequences, and used these predictors to estimate the abundance of incorrectly annotated regions in putative sequences, as well as to explore the link between such regions and intrinsic disorder. Sequence alignment is an essential tool in modern bioinformatics. Substitution matrices - such as the BLOSUM family - contain 20x20 parameters which are related to the evolutionary rates of amino acid substitutions. I explored various strategies for extension of sequence alignment to utilize the (predicted) disorder/structure information about the sequences being aligned. These strategies employ an extended 40 symbol alphabet which contains 20 symbols for amino acids in ordered regions and 20 symbols for amino acids in IDRs, as well as expanded 40x40 and 40x20 matrices. The new matrices exhibit significant and substantial differences in the substitution scores for IDRs and structured regions. Tests on a reference dataset show that 40x40 matrices perform worse than the standard 20x20 matrices, while 40x20 matrices - used in a scenario where ID is predicted for a query sequence but not for the target sequences - have at least comparable performance. However, I also demonstrate that the variations in performance between 20x20 and 20x40 matrices are insignificant compared to the variation in obtained matrices that occurs when the underlying algorithm for calculation of substitution matrices is changed.
Temple University--Theses

APA, Harvard, Vancouver, ISO, and other styles

30

Mokin, Sergey. "Measuring deviation from a deeply conserved consensus in protein multiple sequence alignments." Thesis, McGill University, 2008. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=21956.

Full text

Abstract:

Proteins across species show variable degrees of conservation. Different patterns of conservation in the columns of an alignment indicate different evolutionary pressures on sequences. Protein conservation analysis is useful for a wide variety of applications, including disease mutation assessment, pseudogene analysis and functional residue prediction. This study describes a novel measure of column conservation in protein multiple sequence alignments (‘MSA'), and the application of this measure to calculate statistical deviation from alignment consensus (‘SDAC'). We have assessed SDAC for two case studies of sequences: (a) putative pseudogenes in Mycobacteria, and (b) young lineage-specific retrotransposed sequences in the human and mouse genomes. In the procedure, we rank residue positions for deep conservation, and evaluate statistically significant violations from MSA consensus. Novel conservation measure clearly indicated a variable degree of physiochemical conservation for a given column entropy. That, in turn, enabled us to detect deviations from physiochemical consensus in a protein MSA, which are not found by entropy measures.
D'une espèce à l'autre, des variations peuvent survenir dans la composition des protéines. Les tendances suivies par les colonnes d'un alignement de séquences multiples reflètent les différentes pressions évolutionnaires imposes sur les séquences. Les analyses de conservation de protéines sont utiles à plusieurs fins, comme dans l'évaluation des mutations de maladies, l'analyse de pseudogenes ainsi que les prédictions fonctionnelles de résidus. Cette étude décrit une nouvelle mesure de conservation de colonnes pour les analyses d'alignement de séquences multiples. De plus, nous décrivons l'utilisation de cette nouvelle mesure pour calculer la déviation statistique avec un consensus d'alignement. Nous avons utilisé cette mesure pour deux études cas de séquence : (a) Celle de pseudogenes putatifs du Mycobactérie, et (b) Celle de jeunes séquences spécifiques a certains lignages rétrotransposés dans les génomes humains et souris. Ce faisant, nous avons classifié les positions de résidus hautement conservés et avons évalué les cas ou d'importantes variations existent avec les consensus des alignements de séquences multiples. Cette nouvelle échelle de conservation indique qu'il existe un degré variable de conservation physiochimique pour une entropie fixe des colonnes. En retour, ceci nous permet de détecter les variations physiochimiques des consensus d'une colonne qui ne serait autrement pas détecté par des mesures d'entropie.

APA, Harvard, Vancouver, ISO, and other styles

31

Almeida, André Atanasio Maranhão 1981. "Novas abordagens para o problema do alinhamento múltiplo de sequências." [s.n.], 2013. http://repositorio.unicamp.br/jspui/handle/REPOSIP/275646.

Full text

Abstract:

Orientador: Zanoni Dias
Tese (doutorado) - Universidade Estadual de Campinas, Instituto de Computação
Made available in DSpace on 2018-08-22T15:29:14Z (GMT). No. of bitstreams: 1 Almeida_AndreAtanasioMaranhao_D.pdf: 2248939 bytes, checksum: b57ed5328b80a2fc7f36d1509558e756 (MD5) Previous issue date: 2013
Resumo: Alinhamento de seqüências é, reconhecidamente, uma das tarefas de maior importância em bioinformática. Tal importância origina-se no fato de ser uma operação básica utilizada por diversos outros procedimentos na área, como busca em bases de dados, visualização do efeito da evolução em uma família de proteínas, construção de árvores filogenéticas e identificação de motifs preservados. Seqüências podem ser alinhadas aos pares, problema para o qual já se conhece algoritmo exato com complexidade de tempo O(l2), para seqüências de comprimento l. Pode-se também alinhar simultaneamente três ou mais seqüências, o que é chamado de alinhamento múltiplo de seqüências (MSA, do inglês Multiple Sequence Alignment ). Este, que é empregado em tarefas como detecção de padrões para caracterizar famílias protéicas e predição de estruturas secundárias e terciárias de proteínas, é um problema NP - Difícil. Neste trabalho foram desenvolvidos métodos heurísticos para alinhamento múltiplo de seqüências de proteína. Estudaram-se as principais abordagens e métodos existentes e foi realizada uma série de implementações e avaliações. Em um primeiro momento foram construídos 342 alinhadores múltiplos utilizando a abordagem progressiva. Esta, que é uma abordagem largamente utilizada para construção de MSAs, consiste em três etapas. Na primeira delas é computada a matriz de distâncias. Em seguida, uma árvore guia é gerada com base na matriz e, finalmente, o MSA é construído através de alinhamentos de pares, cuja ordem é definida pela árvore. Os alinhadores desenvolvidos combinam diferentes métodos aplicados a cada uma das etapas. Para a computação das matrizes de distâncias foram desenvolvidos dois métodos, que são capazes também de gerar alinhamentos de pares de seqüências. Um deles constrói o alinhamento com base em alinhamentos locais e o outro utiliza uma função logarítmica para a penalização de gaps. Foram utilizados ainda outros métodos disponíveis numa ferramenta chamada PHYLIP. Para a geração das árvores guias, foram utilizados os métodos clássicos UPGMA e Neighbor Joining. Usaram-se implementações disponíveis em uma ferramenta chamada R. Já para a construção do alinhamento múltiplo, foram implementados os métodos seleção por bloco único e seleção do par mais próximo. Estes, que se destinam a seleção xiii do par de alinhamentos a agrupar no ciclo corrente, são comumente utilizados para tal tarefa. Já para o agrupamento de um par de alinhamentos, foram implementados 12 métodos inspirados em métodos comumente utilizados - alinhamento de consensos e alinhamento de perfis. Foram feitas todas as combinações possíveis entre esses métodos, resultando em 342 alinhadores. Eles foram avaliados quanto à qualidade dos alinhamentos que geram e avaliou-se também o desempenho dos métodos, utilizados em cada etapa. Em seguida foram realizadas avaliações no contexto de alinhamento baseado em consistência. Nesta abordagem, considera-se MSA ótimo aquele que estão de acordo com a maioria dos alinhamentos ótimos para os n(n ? 1)/2 alinhamentos de pares contidos no MSA. Alterações foram realizadas em um alinhador múltiplo conhecido, MUMMALS, que usa a abordagem. As modificações foram feitas no método de contagem k-mer, assim como, em outro momento, substituiu-se a parte inicial do algoritmo. Foram alterados os métodos para computação da matriz de distâncias e para geração da árvore guia por outros que foram bem avaliados nos testes realizados para a abordagem progressiva. No total, foram implementadas e avaliadas 89 variações do algoritmo original do MUMMALS e, apesar do MUMMALS já produzir alinhamentos de alta qualidade, melhoras significativas foram alcançadas. O trabalho foi concluído com a implementação e a avaliação de algoritmos iterativos. Estes se caracterizam pela dependência de outros alinhadores para a produção de alinhamentos iniciais. Ao alinhador iterativo cabe a tarefa de refinar tais alinhamentos através de uma série de ciclos até que haja uma estabilização na qualidade dos alinhamentos. Foram implementados e avaliados dois alinhadores iterativos não estocásticos, assim como um algoritmo genético (GA) voltado para a geração de MSAs. Nesse algoritmo genético, implementado na forma de um ambiente parametrizável para execução de algoritmos genéticos para MSA, chamado ALGAe, foram realizadas diversas experiências que progressivamente elevaram a qualidade dos alinhamentos gerados. No ALGAe foram incluídas outras abordagens para construção de alinhamentos múltiplos, tais como baseada em blocos, em consenso e em modelos. A primeira foi aplicada na geração de indivíduos para a população inicial. Foram implementados alinhadores baseados em blocos usando duas abordagens distintas e, para uma delas, foram implementadas cinco variações. A segunda foi aplicada na definição de um operador de cruzamento, que faz uso da ferramenta M-COFFEE para realizar alinhamentos baseados em consenso a partir de indivíduos da população corrente do GA, e a terceira foi utilizada para definir uma função de aptidão, que utiliza a ferramenta PSIPRED para predição das estruturas secundárias das seqüências. O ALGAe permite a realização de uma grande variedade de novas avaliações
Abstract: Sequence alignment is one the most important tasks of bioinformatics. It is a basic operation used for several procedures in that domain, such as sequence database searches, evolution effect visualization in an entire protein family, phylogenetic trees construction and preserved motifs identification. Sequences can be aligned in pairs and generate a pairwise alignment. Three or more sequences can also be simultaneously aligned and generate a multiple sequence alignment (MSA). MSAs could be used for pattern recognition for protein family characterization and secondary and tertiary protein structure prediction. Let l be the sequence length. The pairwise alignment takes time O(l2) to build an exact alignment. However, multiple sequence alignment is a NP-Hard problem. In this work, heuristic methods were developed for multiple protein sequence alignment. The main approaches and methods applied to the problem were studied and a series of aligners developed and evaluated. In a first moment 342 multiple aligners using the progressive approach were built. That is a largely used approach for MSA construction and is composed by three steps. In the first one a distance matrix is computed. Then, a guide tree is built based on the matrix and finally the MSA is constructed through pairwise alignments. The order to the pairwise alignments is defined by the tree. The developed aligners combine distinct methods applied to each of steps. Then, evaluations in the consistency based alignment context were performed. In that approach, a MSA is optimal when agree with the majority along all possible optimal pairwise alignments. MUMMALS is a known consistency based aligner. It was changed in this evaluation. The k-mer counting method was modified in two distinct ways. The k value and the compressed alphabet were ranged. In another evaluation, the k-mer counting method and guide tree construction method were replaced. In the last stage of the work, iterative algorithms were developed and evaluated. Those methods are characterized by other aligner's dependence. The other aligners generate an initial population and the iterative aligner performs a refinement procedure, which iteratively changes the alignments until the alignments quality are stabilized. Several evaluations were performed. However, a genetic algorithm for MSA construction stood out along this stage. In that aligner were added other approaches for multiple sequence alignment construction, such as block based, consensus based and template based. The first one was applied to initial population generation, the second one was used for a crossover operator creation and the third one defined a fitness function
Doutorado
Ciência da Computação
Doutor em Ciência da Computação

APA, Harvard, Vancouver, ISO, and other styles

32

Koike, Ryoaro. "Comparison of Protein Sequences and Structures based on the Partition Function Formulation : Probabilistic Alignment." 京都大学 (Kyoto University), 2003. http://hdl.handle.net/2433/148598.

Full text

APA, Harvard, Vancouver, ISO, and other styles

33

Liang, Chengzhi. "COPIA: A New Software for Finding Consensus Patterns in Unaligned Protein Sequences." Thesis, University of Waterloo, 2001. http://hdl.handle.net/10012/1050.

Full text

Abstract:

Consensus pattern problem (CPP) aims at finding conserved regions, or motifs, in unaligned sequences. This problem is NP-hard under various scoring schemes. To solve this problem for protein sequences more efficiently,a new scoring scheme and a randomized algorithm based on substitution matrix are proposed here. Any practical solutions to a bioinformatics problem must observe twoprinciples: (1) the problem that it solves accurately describes the real problem; in CPP, this requires the scoring scheme be able to distinguisha real motif from background; (2) it provides an efficient algorithmto solve the mathematical problem. A key question in protein motif-finding is how to determine the motif length. One problem in EM algorithms to solve CPP is how to find good startingpoints to reach the global optimum. These two questions were both well addressed under this scoring scheme,which made the randomized algorithm both fast and accurate in practice. A software, COPIA (COnsensus Pattern Identification and Analysis),has been developed implementing this algorithm. Experiments using sequences from the von Willebrand factor (vWF)familyshowed that it worked well on finding multiple motifs and repeats. COPIA's ability to find repeats makes it also useful in illustrating the internal structures of multidomain proteins. Comparative studies using several groups of protein sequences demonstrated that COPIA performed better than the commonly used motif-finding programs.

APA, Harvard, Vancouver, ISO, and other styles

34

Capella, Gutiérrez Salvador Jesús 1985. "Analysis of multiple protein sequence alignments and phylogenetic trees in the context of phylogenomics studies." Doctoral thesis, Universitat Pompeu Fabra, 2012. http://hdl.handle.net/10803/97289.

Full text

Abstract:

Phylogenomics is a biological discipline which can be understood as the intersection of the fields of genomics and evolution. Its main focuses are the analyses of genomes through the evolutionary lens and the understanding of how different organisms relate to each other. Moreover, phylogenomics allows to make accurate functional annotations of newly sequenced genomes. This discipline has grown in response to the deluge of data coming from different genome projects. To achieve their objectives, phylogenomics heavily depends on the accuracy of different methods to generate precise phylogenetic trees. Phylogenetic trees are the basic tool of this field and serve to represent how sequences or species relate to each other through common ancestry. During my thesis, I have centered my efforts in improving an automated pipeline to generate accurate phylogenetic trees and its posterior publication through a public database. Among the efforts to improve the pipeline, I have specially focused on the problem of multiple sequence alignment post-processing, which has been shown to be central to the reliability of subsequent analyses. Subsequently I have applied this pipeline, and a battery of other phylogenomics tools, to the study of the phylogenetic position of Microsporidia, a group of fast-evolving intracellular parasites. Due to their special genomic features, Microsporidia evolution constitutes one of the classical examples of challenging problems for phylogenomics. Finally, I have also used the pipeline as a part of a newly designed method for selecting robust combinations of phylogenetic gene markers. I have used this method for selecting optimal gene sets to assess the phylogenetic relationships within fungi and cyanobacteria, showing that the potential of these genes as phylogenetic markers goes well beyond the species used for their selection.
Filogenómica es una disciplina biológica que puede ser entendida como la intersección entre los campos de la genómica y la evolución. Su área de estudio es el análisis evolutivo de los genomas y como se relacionan las distintas especies entre sí. Además, la filogenómica tiene como objetivo anotar funcionalmente, con gran precisi ón, genomas recién secuenciados. De hecho, esta disciplina ha crecido rápidamente en los úultimos años como respuesta a la avalancha de datos provenientes de distintos proyectos genómicos. Para alcanzar sus objetivos, la filogenómica depende, en gran medida, de los distintos métodos usados para generar árboles filogenéticos. Los árboles filogenéticos son las herramientas básicas de la filogenómica y sirven para representar como secuencias y especies se relacionan entre sí por ascendencia. Durante el desarrollo de mi tesis, he centrado mis esfuerzos en mejorar una pipeline (conjunto de programas ejecutados de forma controlada) automática que permite generar árboles filogenéticos con gran precisión, y como ofrecer estos datos a la comunidad científica a través de una base de datos. Entre los esfuerzos realizados para mejorar la pipeline, me he centrado especialmente en el post-procesamiento previo a cualquier análisis de alineamientos múltiples de secuencias, ya que la calidad del alineamiento determina la de los estudios posteriores. En un contexto más biológico, he usado esta pipeline junto con otras herramientas filogenómicas en el estudio de la posición filogenética de Microsporidia. Dadas sus características genómicas especiales, la evolución de Microsporidia constituye uno de los problemas clásicos y difíciles de resolver en filogenómica. Finalmente, he usado también la pipeline como parte de un nuevo método para seleccionar combinaciones óptimas de genes con potencial como marcadores filogenéticos. De hecho, he usado este método para identificar conjuntos de marcadores filogenéticos que permiten reconstruir con alto grado de precisión las relaciones evolutivas en Cyanobacterias y en Hongos. Lo más interesante de este método es que eval úa la fiabilidad de los marcadores en especies no usadas para su selección.

APA, Harvard, Vancouver, ISO, and other styles

35

Scheeff, Eric David. "Multiple alignments of protein structures and their application to sequence annotation with hidden Markov models /." Diss., Connect to a 24 p. preview or request complete full text in PDF format. Access restricted to UC campuses, 2003. http://wwwlib.umi.com/cr/ucsd/fullcit?p3112860.

Full text

APA, Harvard, Vancouver, ISO, and other styles

36

Nosek, Ondřej. "Hardwarová akcelerace algoritmu pro hledání podobnosti dvou DNA řetězců." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2007. http://www.nusl.cz/ntk/nusl-236882.

Full text

Abstract:

Methods for aproximate string matching of various sequences used in bioinformatics are crucial part of development in this branch. Tasks are of very large time complexity and therefore we want create a hardware platform for acceleration of these computations. Goal of this work is to design a generalized architecture based on FPGA technology, which can work with various types of sequences. Designed acceleration card will use especially dynamic algorithms like Needleman-Wunsch and Smith-Waterman.

APA, Harvard, Vancouver, ISO, and other styles

37

Yáñez, Marissa Elena. "Structural and functional studies of minor pseudopilins from the type 2 secretion system of Vibrio cholerae /." Thesis, Connect to this title online; UW restricted, 2007. http://hdl.handle.net/1773/8086.

Full text

APA, Harvard, Vancouver, ISO, and other styles

38

Pelikán, Ondřej. "Predikce škodlivosti aminokyselinových mutací s využitím metody MAPP." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2014. http://www.nusl.cz/ntk/nusl-236151.

Full text

Abstract:

This thesis discusses the issue of predicting the effect of amino acid substitutions on protein function using MAPP method. This method requires the multiple sequence alignment and phylogenetic tree constructed by third-party tools. Main goal of this thesis is to find the combination of suitable tools and their parameters to generate the inputs of MAPP method on the basis of analysis on one massively mutated protein. Then, the MAPP method is tested with chosen combination of parameters and tools on two large independent datasets and consequently is compared with the other tools focused on prediction of the effect of mutations. Apart from this the web interface for the MAPP method was created. This interface simplifies the use of the method since the user need not to install any tools or set any parameters.

APA, Harvard, Vancouver, ISO, and other styles

39

Lehrach, Wolfgang. "Bayesian machine learning methods for predicting protein-peptide interactions and detecting mosaic structures in DNA sequences alignments." Thesis, University of Edinburgh, 2010. http://hdl.handle.net/1842/29846.

Full text

Abstract:

Short well-defined domains known as peptide recognition modules (PRMs) regulate many important protein-protein interactions involved in the formation of macromolecular complexes and biochemical pathways. We propose a probabilistic discriminative approach for predicting PRM-mediated protein-protein interactions from sequence data. The model suffered from over-fitting, so Laplacian regularisation was found to be important in achieving a reasonable generalisation performance. A hybrid approach yielded the best performance, where the binding site motifs were initialised with the predictions of a generative model. We also propose another discriminative model which can be applied to all sequences present in the organism at significantly reduced computational cost. This is due to its additional assumption that the underlying binding sites within the same class of PRMs are similar. We investigated rate variation along DNA sequence alignments, modelling confounding effects such as recombination. Traditional approaches to phylogenetic inference assume that a single phylogenetic tree can represent the relationships and divergences between the taxa. However, taxa sequences exhibit varying levels of conservation, e.g. due to regulatory elements and active binding sites, and certain bacteria and viruses undergo interspecific recombination. We propose a phylogenetic factorial bacteria and viruses undergo interspecific recombination. We propose a phylogenetic factorial hidden Markov model to infer recombination and rate variation. We examined the performance of our model and inference scheme on various synthetic alignments, and compared it to state of the art breakpoint models. We investigated three DNA sequence alignments: one of maize actin genes, one acterial (Neisseria), and the other of HIV-1. Inference is carried out in the Bayesian framework, using Reversible Jump Markov Chain Monte Carlo.

APA, Harvard, Vancouver, ISO, and other styles

40

Janda, Jan-Oliver [Verfasser], and Rainer [Akademischer Betreuer] Merkl. "Data mining for important amino acid residues in multiple sequence alignments and protein structures / Jan-Oliver Janda. Betreuer: Rainer Merkl." Regensburg : Universitätsbibliothek Regensburg, 2014. http://d-nb.info/1051132843/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

41

Janda, Jan-Oliver Verfasser], and Rainer [Akademischer Betreuer] [Merkl. "Data mining for important amino acid residues in multiple sequence alignments and protein structures / Jan-Oliver Janda. Betreuer: Rainer Merkl." Regensburg : Universitätsbibliothek Regensburg, 2014. http://nbn-resolving.de/urn:nbn:de:bvb:355-epub-299076.

Full text

APA, Harvard, Vancouver, ISO, and other styles

42

Durek, Pawel, Christian Schudoma, Wolfram Weckwerth, Joachim Selbig, and Dirk Walther. "Detection and characterization of 3D-signature phosphorylation site motifs and their contribution towards improved phosphorylation site prediction in proteins." Universität Potsdam, 2009. http://opus.kobv.de/ubp/volltexte/2010/4512/.

Full text

Abstract:

Background: Phosphorylation of proteins plays a crucial role in the regulation and activation of metabolic and signaling pathways and constitutes an important target for pharmaceutical intervention. Central to the phosphorylation process is the recognition of specific target sites by protein kinases followed by the covalent attachment of phosphate groups to the amino acids serine, threonine, or tyrosine. The experimental identification as well as computational prediction of phosphorylation sites (P-sites) has proved to be a challenging problem. Computational methods have focused primarily on extracting predictive features from the local, one-dimensional sequence information surrounding phosphorylation sites. Results: We characterized the spatial context of phosphorylation sites and assessed its usability for improved phosphorylation site predictions. We identified 750 non-redundant, experimentally verified sites with three-dimensional (3D) structural information available in the protein data bank (PDB) and grouped them according to their respective kinase family. We studied the spatial distribution of amino acids around phosphorserines, phosphothreonines, and phosphotyrosines to extract signature 3D-profiles. Characteristic spatial distributions of amino acid residue types around phosphorylation sites were indeed discernable, especially when kinase-family-specific target sites were analyzed. To test the added value of using spatial information for the computational prediction of phosphorylation sites, Support Vector Machines were applied using both sequence as well as structural information. When compared to sequence-only based prediction methods, a small but consistent performance improvement was obtained when the prediction was informed by 3D-context information. Conclusion: While local one-dimensional amino acid sequence information was observed to harbor most of the discriminatory power, spatial context information was identified as relevant for the recognition of kinases and their cognate target sites and can be used for an improved prediction of phosphorylation sites. A web-based service (Phos3D) implementing the developed structurebased P-site prediction method has been made available at http://phos3d.mpimp-golm.mpg.de.

APA, Harvard, Vancouver, ISO, and other styles

43

Simms, Amy Nicole. "Examination of Neisseria gonorrhoeae opacity protein expression during experimental murine genital tract infection /." Download the dissertation in PDF, 2005. http://www.lrc.usuhs.mil/dissertations/pdf/Simms2005.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

44

Grigolon, Silvia. "Modelling and inference for biological systems : from auxin dynamics in plants to protein sequences." Thesis, Paris 11, 2015. http://www.theses.fr/2015PA112178/document.

Full text

Abstract:

Tous les systèmes biologiques sont formés d’atomes et de molécules qui interagissent et dont émergent des propriétés subtiles et complexes. Par ces interactions, les organismes vivants peuvent subvenir à toutes leurs fonctions vitales. Ces propriétés apparaissent dans tous les systèmes biologiques à des niveaux différents, du niveau des molécules et gènes jusqu’aux niveau des cellules et tissus. Ces dernières années, les physiciens se sont impliqués dans la compréhension de ces aspects particulièrement intrigants, en particulier en étudiant les systèmes vivants dans le cadre de la théorie des réseaux, théorie qui offre des outils d’analyse très puissants. Il est possible aujourd’hui d’identifier deux classes d’approches qui sont utilisée pour étudier ces types de systèmes complexes : les méthodes directes de modélisation et les approches inverses d’inférence. Dans cette thèse, mon travail est basé sur les deux types d’approches appliquées à trois niveaux de systèmes biologiques. Dans la première partie de la thèse, je me concentre sur les premières étapes du développement des tissus biologiques des plantes. Je propose un nouveau modèle pour comprendre la dynamique collective des transporteurs de l’hormone auxine et qui permet la croissance non-homogène des tissu dans l’espace et le temps. Dans la deuxième partie de la thèse, j’analyse comment l’évolution contraint la diversité́ de séquence des protéines tout en conservant leur fonction dans différents organismes. En particulier, je propose une nouvelle méthode pour inférer les sites essentiels pour la fonction ou la structure de protéines à partir d’un ensemble de séquences biologiques. Finalement, dans la troisième partie de la thèse, je travaille au niveau cellulaire et étudie les réseaux de signalisation associés à l’auxine. Dans ce contexte, je reformule un modèle préexistant et propose une nouvelle technique qui permet de définir et d’étudier la réponse du système aux signaux externes pour des topologies de réseaux différentes. J’exploite ce cadre théorique pour identifier le rôle fonctionnel de différentes topologies dans ces systèmes
All biological systems are made of atoms and molecules interacting in a non- trivial manner. Such non-trivial interactions induce complex behaviours allow- ing organisms to fulfill all their vital functions. These features can be found in all biological systems at different levels, from molecules and genes up to cells and tissues. In the past few decades, physicists have been paying much attention to these intriguing aspects by framing them in network approaches for which a number of theoretical methods offer many powerful ways to tackle systemic problems. At least two different ways of approaching these challenges may be considered: direct modeling methods and approaches based on inverse methods. In the context of this thesis, we made use of both methods to study three different problems occurring on three different biological scales. In the first part of the thesis, we mainly deal with the very early stages of tissue development in plants. We propose a model aimed at understanding which features drive the spontaneous collective behaviour in space and time of PINs, the transporters which pump the phytohormone auxin out of cells. In the second part of the thesis, we focus instead on the structural properties of proteins. In particular we ask how conservation of protein function across different organ- isms constrains the evolution of protein sequences and their diversity. Hereby we propose a new method to extract the sequence positions most relevant for protein function. Finally, in the third part, we study intracellular molecular networks that implement auxin signaling in plants. In this context, and using extensions of a previously published model, we examine how network structure affects network function. The comparison of different network topologies provides insights into the role of different modules and of a negative feedback loop in particular. Our introduction of the dynamical response function allows us to characterize the systemic properties of the auxin signaling when external stimuli are applied

APA, Harvard, Vancouver, ISO, and other styles

45

Chrysostomou, Charalambos. "Characterisation and classification of protein sequences by using enhanced amino acid indices and signal processing-based methods." Thesis, De Montfort University, 2013. http://hdl.handle.net/2086/9895.

Full text

Abstract:

Protein sequencing has produced overwhelming amount of protein sequences, especially in the last decade. Nevertheless, the majority of the proteins' functional and structural classes are still unknown, and experimental methods currently used to determine these properties are very expensive, laborious and time consuming. Therefore, automated computational methods are urgently required to accurately and reliably predict functional and structural classes of the proteins. Several bioinformatics methods have been developed to determine such properties of the proteins directly from their sequence information. Such methods that involve signal processing methods have recently become popular in the bioinformatics area and been investigated for the analysis of DNA and protein sequences and shown to be useful and generally help better characterise the sequences. However, there are various technical issues that need to be addressed in order to overcome problems associated with the signal processing methods for the analysis of the proteins sequences. Amino acid indices that are used to transform the protein sequences into signals have various applications and can represent diverse features of the protein sequences and amino acids. As the majority of indices have similar features, this project proposes a new set of computationally derived indices that better represent the original group of indices. A study is also carried out that resulted in finding a unique and universal set of best discriminating amino acid indices for the characterisation of allergenic proteins. This analysis extracts features directly from the protein sequences by using Discrete Fourier Transform (DFT) to build a classification model based on Support Vector Machines (SVM) for the allergenic proteins. The proposed predictive model yields a higher and more reliable accuracy than those of the existing methods. A new method is proposed for performing a multiple sequence alignment. For this method, DFT-based method is used to construct a new distance matrix in combination with multiple amino acid indices that were used to encode protein sequences into numerical sequences. Additionally, a new type of substitution matrix is proposed where the physicochemical similarities between any given amino acids is calculated. These similarities were calculated based on the 25 amino acids indices selected, where each one represents a unique biological protein feature. The proposed multiple sequence alignment method yields a better and more reliable alignment than the existing methods. In order to evaluate complex information that is generated as a result of DFT, Complex Informational Spectrum Analysis (CISA) is developed and presented. As the results show, when protein classes present similarities or differences according to the Common Frequency Peak (CFP) in specific amino acid indices, then it is probable that these classes are related to the protein feature that the specific amino acid represents. By using only the absolute spectrum in the analysis of protein sequences using the informational spectrum analysis is proven to be insufficient, as biologically related features can appear individually either in the real or the imaginary spectrum. This is successfully demonstrated over the analysis of influenza neuraminidase protein sequences. Upon identification of a new protein, it is important to single out amino acid responsible for the structural and functional classification of the protein, as well as the amino acids contributing to the protein's specific biological characterisation. In this work, a novel approach is presented to identify and quantify the relationship between individual amino acids and the protein. This is successfully demonstrated over the analysis of influenza neuraminidase protein sequences. Characterisation and identification problem of the Influenza A virus protein sequences is tackled through a Subgroup Discovery (SD) algorithm, which can provide ancillary knowledge to the experts. The main objective of the case study was to derive interpretable knowledge for the influenza A virus problem and to consequently better describe the relationships between subtypes of this virus. Finally, by using DFT-based sequence-driven features a Support Vector Machine (SVM)-based classification model was built and tested, that yields higher predictive accuracy than that of SD. The methods developed and presented in this study yield promising results and can be easily applied to proteomic fields.

APA, Harvard, Vancouver, ISO, and other styles

46

Hatherley, Rowan. "Structural bioinformatics studies and tool development related to drug discovery." Thesis, Rhodes University, 2016. http://hdl.handle.net/10962/d1020021.

Full text

Abstract:

This thesis is divided into two distinct sections which can be combined under the broad umbrella of structural bioinformatics studies related to drug discovery. The first section involves the establishment of an online South African natural products database. Natural products (NPs) are chemical entities synthesised in nature and are unrivalled in their structural complexity, chemical diversity, and biological specificity, which has long made them crucial to the drug discovery process. South Africa is rich in both plant and marine biodiversity and a great deal of research has gone into isolating compounds from organisms found in this country. However, there is no official database containing this information, making it difficult to access for research purposes. This information was extracted manually from literature to create a database of South African natural products. In order to make the information accessible to the general research community, a website, named “SANCDB”, was built to enable compounds to be quickly and easily searched for and downloaded in a number of different chemical formats. The content of the database was assessed and compared to other established natural product databases. Currently, SANCDB is the only database of natural products in Africa with an online interface. The second section of the thesis was aimed at performing structural characterisation of proteins with the potential to be targeted for antimalarial drug therapy. This looked specifically at 1) The interactions between an exported heat shock protein (Hsp) from Plasmodium falciparum (P. falciparum), PfHsp70-x and various host and exported parasite J proteins, as well as 2) The interface between PfHsp90 and the heat shock organising protein (PfHop). The PfHsp70-x:J protein study provided additional insight into how these two proteins potentially interact. Analysis of the PfHsp90:PfHop also provided a structural insight into the interaction interface between these two proteins and identified residues that could be targeted due to their contribution to the stability of the Hsp90:Hop binding complex and differences between parasite and human proteins. These studies inspired the development of a homology modelling tool, which can be used to assist researchers with homology modelling, while providing them with step-by-step control over the entire process. This thesis presents the establishment of a South African NP database and the development of a homology modelling tool, inspired by protein structural studies. When combined, these two applications have the potential to contribute greatly towards in silico drug discovery research.

APA, Harvard, Vancouver, ISO, and other styles

47

Khan, Abdul Kareem. "Electrostaticanalisys the Ras active site." Doctoral thesis, Universitat Pompeu Fabra, 2009. http://hdl.handle.net/10803/7161.

Full text

Abstract:

La preorganització electrostàtica del centre actiu s'ha postulat com el mecanisme genèric de l'acció dels enzims. Així, alguns residus "estratègics" es disposarien per catalitzar reaccions interaccionant en una forma més forta amb l'estat de transició, baixant d'aquesta manera el valor de l'energia dactivació g cat. S'ha proposat que aquesta preorientació electrostática s'hauria de poder mostrar analitzant l'estabilitat electrostàtica de residus individuals en el centre actiu.
Ras es una proteïna essencial de senyalització i actúa com un interruptor cel.lular. Les característiques estructurals de Ras en el seu estat actiu (ON) són diferents de les que té a l'estat inactiu (OFF). En aquesta tesi es duu a terme una anàlisi exhaustiva de l'estabilitat dels residus del centre actiu deRas en l'estat actiu i inactiu.
The electrostatic preorganization of the active site has been put forward as the general framework of action of enzymes. Thus, enzymes would position "strategic" residues in such a way to be prepared to catalyze reactions by
interacting in a stronger way with the transition state, in this way decreasing the activation energy g cat for the catalytic process. It has been proposed that
such electrostatic preorientation should be shown by analyzing the electrostatic stability of individual residues in the active site.
Ras protein is an essential signaling molecule and functions as a switch in the
cell. The structural features of the Ras protein in its active state (ON state) are different than those in its inactive state (OFF state). In this thesis, an exhaustive analysis of the stability of residues in the active and inactive Ras active site is performed.

APA, Harvard, Vancouver, ISO, and other styles

48

Chi, Yang, and 楊奇. "Use Sequence-Structure Alignment Approach to Predict Protein Function." Thesis, 2005. http://ndltd.ncl.edu.tw/handle/14206718616440163154.

Full text

Abstract:

碩士
中華大學
資訊工程學系碩士班
93
Protein interaction plays important role in the most beings. “Guilt-by-Association” is a method in common use to infer functions of protein，that is，if we can realize functions of any one of a pair of proteins which have interactions，we can conclude that the others has high-relative functions. There are three kinds of protein interactions classified by their functions：Metabolism or signal channel，Pattern-formed channel，Organism macro molecule structuring. Evan the part of Organism macro molecule structuring is very important knowledge. No doubt there is close relations between protein functions and its molecule structure,so far there is about 40 percent of protein that we don’t know what functions it has in human’s protein datum. I am very interested in this research; therefore, I think that I can use Sequence and structure to predict protein functions. Though many methods exist to predict protein’s second-class structure and third-class structure,but few considered in the molecule structure factor. Whether it is effective to consider the structure and array factor when we predict protein functions.Therefore,it motive us to find a more reliable method to predict protein functions. In this paper,we attempt to use protein sequence and structure characteristics,derive the second-class structure by first sequence,to predict functions of an unknown protein. The sample data was quoted from the known proteins in PDB(Protein Data Bank) Website, a famous biochemical unit,and make a study by gathering, sorting, pruning, training, and predicting.We will use HMM method to calculate study and predict a first-class array and also second-class array of protein functions. We expect to attain 50% accuracy in prediction by the known proteins data and wish to have some contribution in development of bio-information.

APA, Harvard, Vancouver, ISO, and other styles

49

Ma, Fangrui. "Biological sequence analyses theory, algorithms, and applications /." 2009. http://proquest.umi.com/pqdweb?did=1821098721&sid=1&Fmt=2&clientId=14215&RQT=309&VName=PQD.

Full text

Abstract:

Thesis (Ph.D.)--University of Nebraska-Lincoln, 2009.
Title from title screen (site viewed October 13, 2009). PDF text: xv, 233 p. : ill. ; 4 Mb. UMI publication number: AAT 3360173. Includes bibliographical references. Also available in microfilm and microfiche formats.

APA, Harvard, Vancouver, ISO, and other styles

50

Ho, Cheng Chen, and 何誠禎. "Using Evolutionary Computation to Solve the Multiple Protein Sequence Alignment." Thesis, 2003. http://ndltd.ncl.edu.tw/handle/63510373306799359361.

Full text

Abstract:

碩士
樹德科技大學
資訊管理研究所
91
The problem of multiple sequence alignment (MSA) is the important issue of the molecular biology in recent years. The purpose of molecular sequence alignment is revealing the diversity of structure in the DNA/Protein. MSA is the most common and important technology to compute the molecular sequence alignment of creature. In this paper, we combined genetic algorithm and dynamic programming to solve the problem of MSA. Thus, we used two crossover operators and three mutation operators to improve the molecular sequence alignment. Experimental results on real sequences, which are provided from BAliBASE are given to illustrate the effectiveness of the proposed approach.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Protein sequence alignment'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles