Dissertations / Theses: 'Bioinformatic methods development'

1

Rossini, Roberto. "Development and validation of bioinformatic methods for GRC assembly and annotation." Thesis, Uppsala universitet, Institutionen för biologisk grundutbildning, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-414739.

Full text

Abstract:

This thesis presents the work done during my master degree projects under the supervision of Alexander Suh and Francisco J. Ruiz-Ruano. My work focused on the development of in-silico methods to improve the assembly of the Germline Restricted Chromosome (GRC) of songbirds, more specifically that of zebra finch.GRCs are a good example of the popular saying "The exception that proves the rule". For a very long time, it was assumed that every cell in a healthy multicellular organism carries the same genetic information. Cytogenetic evidence dating back as far as early XX century suggests that this is not always the case, as it has been documented that certain organisms carry supernumerary B chromosomes, which are dispensable chromosomes that are not part of the normal karyotype of a species. GRCs are often regarded as a special case of B chromosomes, where every individual from a species carries an additional chromosome whose presence is restricted to germline cells only. GRCs presence has been documented in insects, hagfishes and songbirds. A peculiar case of GRCs is that of zebra finch, whose GRC has an estimated size of over 150 Mb, accounting for over 10% of zebra finch total genome size. Despite the first cytogenetic evidence of zebra finch GRC dating back to 1998, it was only last year that the first comprehensive genomic study about this relatively large chromosome was published. This study shed some light on the gene content of the GRC in zebra finch, revealing that the GRC of zebra finch mostly consists of paralogs of A chromosomal genes. The GRC assembly and annotation that were published as part of this study included 115 GRC-linked genes that were identified through germline/soma read mapping, as well as 36 manually curated scaffolds with a median length of 3.6 kb. Considering the conspicuous size of the GRC of zebra finch, it is clear that this is a very fragmented and likely incomplete GRC assembly. There are many factors that can have a negative impact on assembly completeness and contiguity. In the GRC case, these factors collectively affect coverage in ways that are not properly handled by available genome assemblers. In the course of my master degree project I developed kFish, a bioinformatic software to perform alignment-free enrichment of GRC-linked barcodes from a 10x Genomics linked-read DNA Chromium library. kFish uses an iterative approach where the k-mer content of a set of GRC-linked sequences is compared with that of reads corresponding to each individual 10x Genomics barcode. This comparison allows kFish to identify likely GRC-linked barcodes, and then only use reads corresponding to these barcodes when trying to assemble the GRC. First benchmarking results generated using five GRC-linked genes from zebra finch as reference sequences, show that kFish is not only capable of assembling already known GRC-linked sequences, but also new ones with high confidence. kFish can do all of this in a matter of hours, using only few gigabytes of system memory, while previous efforts took over two days to assemble zebra finch genome and identify GRC-linked scaffolds using an approach based on read mapping. High quality genome assemblies and annotations are the foundations of modern genomics research, the lack of which greatly limits the breadth of the questions that can be answered. There is still a lot that we do not understand about GRCs, and part of this is due to the lack of high quality GRC assemblies and annotations. Producing such an assembly will likely require an integrated approach, where multiple sequencing technologies as well as bleeding edge bioinformatic tools such as kFish, are combined together to produce an high quality assembly, which will be crucial to unravel the mystery of GRCs function and evolutionary history.

APA, Harvard, Vancouver, ISO, and other styles

2

Ruiz, Arenas Carlos 1990. "Methods and bioinformatic tools to study polymorphic inversions in complex diseases." Doctoral thesis, Universitat Pompeu Fabra, 2019. http://hdl.handle.net/10803/666582.

Full text

Abstract:

Las inversiones cromosómicas son variantes estructurales donde un segmento de ADN cambia su orientación. Las inversiones cromosómicas reducen la recombinación homóloga y producen diferentes haplotipos en los cromosomas estándar e invertidos. Como resultado, influyen en la adaptación y la selección y desempeñan un papel en la susceptibilidad a las enfermedades humanas. Las inversiones se pueden estudiar con métodos experimentales y bioinformáticos. Los datos de SNP array se pueden usar para determinar genotipos de inversión mediante el uso de diferencias de haplotipos entre cromosomas invertidos y estándares. Sin embargo, estos métodos no están optimizados para grandes cohortes (con miles de individuos, como dbGaP o UK Biobank). Además, los métodos actuales solo pueden genotipar las inversiones con dos haplotipos y la clasificación es difícil de armonizar entre cohortes. Finalmente, se conoce que las inversiones cromosómicas afectan la expresión génica y la metilación del ADN. Sin embargo, no existen métodos precisos para evaluar globalmente el efecto de las inversiones en la expresión génica local o la metilación del ADN. El objetivo principal de esta tesis es desarrollar nuevos métodos robustos y escalables así como herramientas bionformáticas para estudiar los efectos fenotípicos y funcionales de las inversiones cromosómicas, superando las limitaciones existentes. Con este fin, he desarrollado un nuevo método para genotipar las inversiones cromosómicas que se puede usar en grandes cohortes, con inversiones con múltiples haplotipos y que utiliza haplotipos de referencia que permite el análisis conjunto de múltiples cohortes. En segundo lugar, he implementado un método multivariante basado en el análisis de la redundancia para estudiar los efectos de las inversiones cromosómicas en la metilación del ADN y la expresión génica locales. A continuación, he aplicado ambos métodos para estudiar el papel de las inversiones cromosómicas en dos grupos de enfermedades complejas: trastornos del neurodesarrollo y cáncer. Finalmente, he desarrollado un nuevo método para estudiar cómo las inversiones cromosómicas afectan los patrones de recombinación. Este método es aplicable a cualquier región genómica que contenga subpoblaciones con diferentes patrones de recombinación, lo que permite asociar estas subpoblaciones a rasgos fenotípicos.
Chromosomal inversions are structural variants where a segment changes its orientation. Chromosomal inversions reduce homologous recombination, producing different haplotypes in standard and inverted chromosomes. As a result, they influence adaptation and selection and play a role in susceptibility to human diseases. Inversions can be studied using experimental and bioinformatic methods. SNP array data can be used to call inversion genotypes by using haplotype differences between inverted and standard chromosomes. However, these methods are not optimized for large cohorts (thousands of individuals from existing databases such as dbGaP or UK Biobank). Also, current methods can only genotype inversions with two haplotypes and the inversion calling is difficult to be harmonized among cohorts. Finally, it is recognized that chromosomal inversions affect gene expression and DNA methylation. However, there are no accurate methods to globally assess the effect of inversions on local gene expression or DNA methylation. The main aim of this thesis is to develop new robust and scalable methods and bioinformatic tools to study the phenotypic and functional effects of chromosomal inversions by overcoming the existing limitations. To this end, I have developed a new method to genotype chromosomal inversions that can be used in large cohorts, inversions with multiple haplotypes and that uses reference haplotypes allowing the integrative analysis of multiple cohorts. Second, I have implemented a multivariate method based on redundancy analysis to study the effects of chromosomal inversions on local DNA methylation and gene expression. Then, I applied both methods to study the role of chromosomal inversions in two groups of complex diseases: neurodevelopmental disorders and cancer. Finally, I developed a new method to study how chromosomal inversions affect recombination patterns. This method is extendable to any genomic regions containing subpopulations with different recombination patterns, allowing associating these subpopulations to phenotypic traits.

APA, Harvard, Vancouver, ISO, and other styles

3

Mastick, Kellen J. "Identification of candidate genes involved in fin/limb development and evolution using bioinformatic methods." Thesis, University of South Dakota, 2014. http://pqdtopen.proquest.com/#viewpdf?dispub=1566765.

Full text

Abstract:

Key to understanding the transition that vertebrates made from water to land is determining the developmental and genomic bases for the changes. New bioinformatic tools provide an opportunity to automate the discovery, broaden the number of, and provide an evidence-based ranking for potential candidate genes. I sought to explore this potential for the fin/limb transition, using the substantial genetic and phenotypic data available in model organism databases. Model organism data was used to hypothesize candidate genes for the fin/limb transition. In addition, 131 fin/limb candidate genes from the literature were extracted and used as a basis for comparison with candidates from the model organism databases. Additionally, seven genes specific to limb and 24 genes specific to fin were identified as future fin/limb transition candidates.

APA, Harvard, Vancouver, ISO, and other styles

4

Zierep, Paul [Verfasser], and Stefan [Akademischer Betreuer] Günther. "Development of bioinformatic methods for the prediction and understanding of biosynthesis and activity of natural products." Freiburg : Universität, 2020. http://d-nb.info/1231711752/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

5

Besnier, Francois. "Development of Variance Component Methods for Genetic Dissection of Complex Traits." Doctoral thesis, Uppsala universitet, Centrum för bioinformatik, 2009. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-101399.

Full text

Abstract:

This thesis presents several developments on Variance component (VC) approach for Quantitative Trait Locus (QTL) mapping. The first part consists of methodological improvements: a new fast and efficient method for estimating IBD matrices, have been developed. The new method makes a better use of the computer resources in terms of computational power and storage memory, facilitating further improvements by resolving methodological bottlenecks in algorithms to scan multiple QTL. A new VC model have also been developed in order to consider and evaluate the correlation of the allelic effects within parental lines origin in experimental outbred crosses. The method was tested on simulated and experimental data and revealed a higher or similar power to detect QTL than linear regression based QTL mapping. The second part focused on the prospect to analyze multi-generational pedigrees by VC approach. The IBD estimation algorithm was extended to include haplotype information in addition to genotype and pedigree to improve the accuracy of the IBD estimates, and a new haplotyping algorithm was developed for limiting the risk of haplotyping errors in multigenerational pedigrees. Those newly developed methods where subsequently applied for the analysis of a nine generations AIL pedigree obtained after crossing two chicken lines divergently selected for body weight. Nine QTL described in a F2 population were replicated in the AIL pedigree, and our strategy to use both genotype and phenotype information from all individuals in the entire pedigree clearly made efficient use of the available genotype information provided in AIL.

APA, Harvard, Vancouver, ISO, and other styles

6

Jauhiainen, Alexandra. "Evaluation and Development of Methods for Identification of Biochemical Networks." Thesis, Linköping University, The Department of Physics, Chemistry and Biology, 2005. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-2811.

Full text

Abstract:

Systems biology is an area concerned with understanding biology on a systems level, where structure and dynamics of the system is in focus. Knowledge about structure and dynamics of biological systems is fundamental information about cells and interactions within cells and also play an increasingly important role in medical applications.

System identification deals with the problem of constructing a model of a system from data and an extensive theory of particularly identification of linear systems exists.

This is a master thesis in systems biology treating identification of biochemical systems. Methods based on both local parameter perturbation data and time series data have been tested and evaluated in silico.

The advantage of local parameter perturbation data methods proved to be that they demand less complex data, but the drawbacks are the reduced information content of this data and sensitivity to noise. Methods employing time series data are generally more robust to noise but the lack of available data limits the use of these methods.

The work has been conducted at the Fraunhofer-Chalmers Research Centre for Industrial Mathematics in Göteborg, and at the division of Computational Biology at the Department of Physics and Measurement Technology, Biology, and Chemistry at Linköping University during the autumn of 2004.

APA, Harvard, Vancouver, ISO, and other styles

7

Hedberg, Lilia. "Identification of obesity-associated SNPs in the human genome : Method development and implementation for SOLiD sequencing data analysis." Thesis, Linköpings universitet, Institutionen för klinisk och experimentell medicin, 2010. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-57932.

Full text

Abstract:

Over the last few years, genome-wide association studies (GWAS) have been used to identify numerous obesity associated SNPs in the human genome. By using linkage studies, candidate obesity genes have been identified. When SNPs in the first intron of FTO were found to be associated to BMI, it became the first gene to be linked to common obesity. In order to look for causative explanations behind the associated SNPs, a re-sequencing of FTO had been performed on the SOLiD sequencing platform. In-house candidate gene, SLCX, was also sequenced in order to evaluate a potential obesity association. The purpose of this project was to analyse the sequences and also to evaluate the quality of the SOLiD sequencing. A part of the project consisted in performing PCRs and selecting genomic regions for future sequencing projects. I developed and implemented a sequence analysis strategy to identify obesity associated SNPs. I found 39 obesity-linked SNPs in FTO, a majority of which were located in introns 1 and 8. I also identified 3 associated intronic SNPs in SLCX. I found that the SOLiD sequencing coverage varies between non-repetitive and repetitive genomic regions, and that it is highest near amplicon ends. Interestingly, coverage varies significantly between different amplicons even after repetitive sequences have been removed, which indicates that it is affected by features inherent to the sequence. Still, the observed allele frequencies for known SNPs were highly correlated with the SNP frequencies documented in HapMap. In conclusion, I verify that SNPs in FTO are associated with obesity and also identify a previously unassociated gene, SLCX, as a potential obesity gene. Re-sequencing of genomic regions on the SOLiD platform was proven to be successful for SNP identification, although the difference in sequencing coverage might be problematic.

APA, Harvard, Vancouver, ISO, and other styles

8

Li, Miaoxin, and 李淼新. "Development of a bioinformatics and statistical framework to integratebiological resources for genome-wide genetic mapping and itsapplications." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2009. http://hub.hku.hk/bib/B43572030.

Full text

APA, Harvard, Vancouver, ISO, and other styles

9

Patel, Hitesh [Verfasser], and Irmgard [Akademischer Betreuer] Merfort. "Use and development of chem-bioinformatics tools and methods for drug discovery and target identification." Freiburg : Universität, 2015. http://d-nb.info/1115495917/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

10

Pennington, Steven. "Pulsed induction, a method to identify genetic regulators of determination events." Thesis, Oklahoma State University, 2015. http://pqdtopen.proquest.com/#viewpdf?dispub=3727701.

Full text

Abstract:

Abstract: Determination is the process in which a stem cell commits to differentiation. The process of how a cell goes through determination is not well understood. Determination is important for proper regulation of cell turn-over in tissue and maintaining the adult stem cell population. Deregulation of determination or differentiation can lead to diseases such as several forms of cancer. In this study I will be using microarrays to identify candidate genes involved in determination by pulse induction of mouse erythroleukemia (MEL) cells with DMSO and looking at gene expression changes as the cells go through the early stages of erythropoiesis. The pulsed induction method I have developed to identify candidate genes is to induce cells for a short time (30 min, 2 hours, etc.) and allow them then to grow for the duration of their differentiation time (8 days). For reference, cells were also harvested at the time when the inducer is removed from the media. Results show high numbers of genes differentially expressed including erythropoiesis specific genes such as GATA1, globin genes and many novel candidate genes that have also been indicated as playing a role in the dynamic early signaling of erythropoiesis. In addition, several genes showed a pendulum effect when allowed to recover, making these interesting candidate genes for maintaining self-renewal of the adult stem cell population.

APA, Harvard, Vancouver, ISO, and other styles

11

Castro-Mondragon, Jaime. "Development of bioinformatics methods for the analysis of large collections of transcription factor binding motifs : positional motif enrichment and motif clustering." Thesis, Aix-Marseille, 2017. http://www.theses.fr/2017AIXM0171.

Full text

Abstract:

Les facteurs transcriptionnels (TF) sont des protéines qui contrôlent l'expression des gènes. Leurs motifs de liaison (TFBM, également appelés motifs) sont généralement représentés sous forme de matrices de scores spécifiques de positions (PSSM). L'analyse de motifs est utilisée en routine afin de découvrir des facteurs candidats pour la régulation d'un jeu de séquences d'intérêt. L'avénement des méthodes à haut débit a permis de détecter des centaines de motifs, qui sont disponibles dans des bases de données. Durant ma thèse, j'ai développé deux nouvelles méthodes et implémenté des outils logiciels pour le traitement de collections massives de motifs: matrix-clustering regroupe les motifs par similarité; position-scan détecte les motifs présentant des préférences de position relativement à une coordonnée de référence. Les méthodes que j'ai développées ont été évaluées sur base de cas d'études, et utilisées pour extraire de l'information interprétable à partir de différents jeux de données de Drosophila melanogaster et Homo sapiens. Les résultats démontrent la pertinence de ces méthodes pour l'analyse de données à haut débit, et l'intérêt de les intégrer dans des pipelines d'analyse de motifs
Transcription Factors (TFs) are DNA-binding proteins that control gene expression. TF binding motifs (TFBMs, simply called “motifs”) are usually represented as Position Specific Scoring Matrices (PSSMs), which can be visualized as sequence logos. The advent of high-throughput methods has allowed the detection of thousands of motifs which are usually stored in databases. In this work I developed two novel methods and implemented software tools to handle large collection of motifs in order to extract interpretable information from high-throughput data: (i) matrix-clustering regroups motifs by similarity and offers a dynamic interface; (2) position-scan detects TFBMs with positional preferences relative to a given reference location (e.g. ChIP-seq peaks, transcription start sites). The methods I developed have been evaluated based on control cases, and applied to extract meaningful information from different datasets from Drosophila melanogaster and Homo sapiens. The results show that these methods enable to analyse motifs in high-throughput datasets, and can be integrated in motif analysis workflows

APA, Harvard, Vancouver, ISO, and other styles

12

Rohde, Christian [Verfasser]. "Development of experimental and bioinformatics methods for high resolution DNA methylation analysis of gene promoters on human chromosome 21 / Christian Rohde." Bremen : IRC-Library, Information Resource Center der Jacobs University Bremen, 2009. http://d-nb.info/1034996371/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

13

Ayllón-Benítez, Aarón. "Development of new computational methods for a synthetic gene set annotation." Thesis, Bordeaux, 2019. http://www.theses.fr/2019BORD0305.

Full text

Abstract:

Les avancées dans l'analyse de l'expression différentielle de gènes ont suscité un vif intérêt pour l'étude d'ensembles de gènes présentant une similarité d'expression au cours d'une même condition expérimentale. Les approches classiques pour interpréter l'information biologique reposent sur l'utilisation de méthodes statistiques. Cependant, ces méthodes se focalisent sur les gènes les plus connus tout en générant des informations redondantes qui peuvent être éliminées en prenant en compte la structure des ressources de connaissances qui fournissent l'annotation. Au cours de cette thèse, nous avons exploré différentes méthodes permettant l'annotation d'ensembles de gènes.Premièrement, nous présentons les solutions visuelles développées pour faciliter l'interprétation des résultats d'annota-tion d'un ou plusieurs ensembles de gènes. Dans ce travail, nous avons développé un prototype de visualisation, appelé MOTVIS, qui explore l'annotation d'une collection d'ensembles des gènes. MOTVIS utilise ainsi une combinaison de deux vues inter-connectées : une arborescence qui fournit un aperçu global des données mais aussi des informations détaillées sur les ensembles de gènes, et une visualisation qui permet de se concentrer sur les termes d'annotation d'intérêt. La combinaison de ces deux visualisations a l'avantage de faciliter la compréhension des résultats biologiques lorsque des données complexes sont représentées.Deuxièmement, nous abordons les limitations des approches d'enrichissement statistique en proposant une méthode originale qui analyse l'impact d'utiliser différentes mesures de similarité sémantique pour annoter les ensembles de gènes. Pour évaluer l'impact de chaque mesure, nous avons considéré deux critères comme étant pertinents pour évaluer une annotation synthétique de qualité d'un ensemble de gènes : (i) le nombre de termes d'annotation doit être réduit considérablement tout en gardant un niveau suffisant de détail, et (ii) le nombre de gènes décrits par les termes sélectionnés doit être maximisé. Ainsi, neuf mesures de similarité sémantique ont été analysées pour trouver le meilleur compromis possible entre réduire le nombre de termes et maintenir un niveau suffisant de détails fournis par les termes choisis. Tout en utilisant la Gene Ontology (GO) pour annoter les ensembles de gènes, nous avons obtenu de meilleurs résultats pour les mesures de similarité sémantique basées sur les nœuds qui utilisent les attributs des termes, par rapport aux mesures basées sur les arêtes qui utilisent les relations qui connectent les termes. Enfin, nous avons développé GSAn, un serveur web basé sur les développements précédents et dédié à l'annotation d'un ensemble de gènes a priori. GSAn intègre MOTVIS comme outil de visualisation pour présenter conjointement les termes représentatifs et les gènes de l'ensemble étudié. Nous avons comparé GSAn avec des outils d'enrichissement et avons montré que les résultats de GSAn constituent un bon compromis pour maximiser la couverture de gènes tout en minimisant le nombre de termes.Le dernier point exploré est une étape visant à étudier la faisabilité d'intégrer d'autres ressources dans GSAn. Nous avons ainsi intégré deux ressources, l'une décrivant les maladies humaines avec Disease Ontology (DO) et l'autre les voies métaboliques avec Reactome. Le but était de fournir de l'information supplémentaire aux utilisateurs finaux de GSAn. Nous avons évalué l'impact de l'ajout de ces ressources dans GSAn lors de l'analyse d’ensembles de gènes. L'intégration a amélioré les résultats en couvrant d'avantage de gènes sans pour autant affecter de manière significative le nombre de termes impliqués. Ensuite, les termes GO ont été mis en correspondance avec les termes DO et Reactome, a priori et a posteriori des calculs effectués par GSAn. Nous avons montré qu'un processus de mise en correspondance appliqué a priori permettait d'obtenir un plus grand nombre d'inter-relations entre les deux ressources
The revolution in new sequencing technologies, by strongly improving the production of omics data, is greatly leading to new understandings of the relations between genotype and phenotype. To interpret and analyze data grouped according to a phenotype of interest, methods based on statistical enrichment became a standard in biology. However, these methods synthesize the biological information by a priori selecting the over-represented terms and focus on the most studied genes that may represent a limited coverage of annotated genes within a gene set. During this thesis, we explored different methods for annotating gene sets. In this frame, we developed three studies allowing the annotation of gene sets and thus improving the understanding of their biological context.First, visualization approaches were applied to represent annotation results provided by enrichment analysis for a gene set or a repertoire of gene sets. In this work, a visualization prototype called MOTVIS (MOdular Term VISualization) has been developed to provide an interactive representation of a repertoire of gene sets combining two visual metaphors: a treemap view that provides an overview and also displays detailed information about gene sets, and an indented tree view that can be used to focus on the annotation terms of interest. MOTVIS has the advantage to solve the limitations of each visual metaphor when used individually. This illustrates the interest of using different visual metaphors to facilitate the comprehension of biological results by representing complex data.Secondly, to address the issues of enrichment analysis, a new method for analyzing the impact of using different semantic similarity measures on gene set annotation was proposed. To evaluate the impact of each measure, two relevant criteria were considered for characterizing a "good" synthetic gene set annotation: (i) the number of annotation terms has to be drastically reduced while maintaining a sufficient level of details, and (ii) the number of genes described by the selected terms should be as large as possible. Thus, nine semantic similarity measures were analyzed to identify the best possible compromise between both criteria while maintaining a sufficient level of details. Using GO to annotate the gene sets, we observed better results with node-based measures that use the terms’ characteristics than with edge-based measures that use the relations terms. The annotation of the gene sets achieved with the node-based measures did not exhibit major differences regardless of the characteristics of the terms used. Then, we developed GSAn (Gene Set Annotation), a novel gene set annotation web server that uses semantic similarity measures to synthesize a priori GO annotation terms. GSAn contains the interactive visualization MOTVIS, dedicated to visualize the representative terms of gene set annotations. Compared to enrichment analysis tools, GSAn has shown excellent results in terms of maximizing the gene coverage while minimizing the number of terms.At last, the third work consisted in enriching the annotation results provided by GSAn. Since the knowledge described in GO may not be sufficient for interpreting gene sets, other biological information, such as pathways and diseases, may be useful to provide a wider biological context. Thus, two additional knowledge resources, being Reactome and Disease Ontology (DO), were integrated within GSAn. In practice, GO terms were mapped to terms of Reactome and DO, before and after applying the GSAn method. The integration of these resources improved the results in terms of gene coverage without affecting significantly the number of involved terms. Two strategies were applied to find mappings (generated or extracted from the web) between each new resource and GO. We have shown that a mapping process before computing the GSAn method allowed to obtain a larger number of inter-relations between the two knowledge resources

APA, Harvard, Vancouver, ISO, and other styles

14

Manser, Paul. "Methods for Integrative Analysis of Genomic Data." VCU Scholars Compass, 2014. http://scholarscompass.vcu.edu/etd/3638.

Full text

Abstract:

In recent years, the development of new genomic technologies has allowed for the investigation of many regulatory epigenetic marks besides expression levels, on a genome-wide scale. As the price for these technologies continues to decrease, study sizes will not only increase, but several different assays are beginning to be used for the same samples. It is therefore desirable to develop statistical methods to integrate multiple data types that can handle the increased computational burden of incorporating large data sets. Furthermore, it is important to develop sound quality control and normalization methods as technical errors can compound when integrating multiple genomic assays. DNA methylation is a commonly studied epigenetic mark, and the Infinium HumanMethylation450 BeadChip has become a popular microarray that provides genome-wide coverage and is affordable enough to scale to larger study sizes. It employs a complex array design that has complicated efforts to develop normalization methods. We propose a novel normalization method that uses a set of stable methylation sites from housekeeping genes as empirical controls to fit a local regression hypersurface to signal intensities. We demonstrate that our method performs favorably compared to other popular methods for the array. We also discuss an approach to estimating cell-type admixtures, which is a frequent biological confound in these studies. For data integration we propose a gene-centric procedure that uses canonical correlation and subsequent permutation testing to examine correlation or other measures of association and co-localization of epigenetic marks on the genome. Specifically, a likelihood ratio test for general association between data modalities is performed after an initial dimension reduction step. Canonical scores are then regressed against covariates of interest using linear mixed effects models. Lastly, permutation testing is performed on weighted correlation matrices to test for co-localization of relationships to physical locations in the genome. We demonstrate these methods on a set of developmental brain samples from the BrainSpan consortium and find substantial relationships between DNA methylation, gene expression, and alternative promoter usage primarily in genes related to axon guidance. We perform a second integrative analysis on another set of brain samples from the Stanley Medical Research Institute.

APA, Harvard, Vancouver, ISO, and other styles

15

Gerst, Michelle Marie. "Improving methods to isolate bacteria producing antibacterial compounds followed by identification and characterization of select antimicrobials." The Ohio State University, 2017. http://rave.ohiolink.edu/etdc/view?acc_num=osu1512070391589857.

Full text

APA, Harvard, Vancouver, ISO, and other styles

16

Stephens, Alex J. "The development of rapid genotyping methods for methicillin-resistant Staphylococcus aureus." Thesis, Queensland University of Technology, 2008. https://eprints.qut.edu.au/20172/1/Alexander_Stephens_Thesis.pdf.

Full text

Abstract:

Methicillin-resistant Staphylococcus aureus (MRSA) is an important human pathogen that is endemic in hospitals all over the world. It has more recently emerged as a serious threat to the general public in the form of community-acquired MRSA. MRSA has been implicated in a wide variety of diseases, ranging from skin infections and food poisoning to more severe and potentially fatal conditions, including; endocarditis, septicaemia and necrotising pneumonia. Treatment of MRSA disease is complicated and can be unsuccessful due to the bacterium's remarkable ability to develop antibiotic resistance. The considerable economic and public health burden imposed by MRSA has fuelled attempts by researchers to understand the evolution of virulent and antibiotic resistant strains and thereby improve epidemiological management strategies. Central to MRSA transmission management strategies is the implementation of active surveillance programs, via which unique genetic fingerprints, or genotypes, of each strain can be identified. Despite numerous advances in MRSA genotyping methodology, there remains a need for a rapid, reproducible, cost-effective method that is capable of producing a high level of genotype discrimination, whilst being suitable for high throughput use. Consequently, the fundamental aim of this thesis was to develop a novel MRSA genotyping strategy incorporating these benefits. This thesis explored the possibility that the development of more efficient genotyping strategies could be achieved through careful identification, and then simple interrogation, of multiple, unlinked DNA loci that exhibit progressively increasing mutation rates. The baseline component of the MRSA genotyping strategy described in this thesis is the allele-specific real-time PCR interrogation of slowly evolving core single nucleotide polymorphisms (SNPs). The genotyping SNP set was identified previously from the Multi-locus sequence typing (MLST) sequence database using an in-house software package named Minimum SNPs. As discussed in Chapter Three, the genotyping utility of the SNP set was validated on 107 diverse Australian MRSA isolates, which were largely clustered into groups of related strains as defined by MLST. To increase the resolution of the SNP genotyping method, a selection of binary virulence genes and antimicrobial resistance plasmids were tested that were successful at sub typing the SNP groups. A comprehensive MRSA genotyping strategy requires characterisation of the clonal background as well as interrogation of the hypervariable Staphylococcal Cassette Chromosome mec (SCCmec) that carries the β-lactam resistance gene, mecA. SCCmec genotyping defines the MRSA lineages; however, current SCCmec genotyping methods have struggled to handle the increasing number of SCCmec elements resulting from a recent explosion of comparative genomic analyses. Chapter Four of this thesis collates the known SCCmec binary marker diversity and demonstrates the ability of Minimum SNPs to identify systematically a minimal set of binary markers capable of generating maximum genotyping resolution. A number of binary targets were identified that indeed permit high resolution genotyping of the SCCmec element. Furthermore, the SCCmec genotyping targets are amenable for combinatorial use with the MLST genotyping SNPs and therefore are suitable as the second component of the MRSA genotyping strategy. To increase genotyping resolution of the slowly evolving MLST SNPs and the SCCmec binary markers, the analysis of a hypervariable repeat region was required. Sequence analysis of the Staphylococcal protein A (spa) repeat region has been conducted frequently with great success. Chapter Five describes the characterisation of the tandem repeats in the spa gene using real-time PCR and high resolution melting (HRM) analysis. Since the melting rate and precise point of dissociation of double stranded DNA is dependent on the size and sequence of the PCR amplicon, the HRM method was used successfully to identify 20 of 22 spa sequence types, without the need for DNA sequencing. The accumulation of comparative genomic information has allowed the systematic identification of key MRSA genomic polymorphisms to genotype MRSA efficiently. If implemented in its entirety, the strategy described in this thesis would produce efficient and deep-rooted genotypes. For example, an unknown MRSA isolate would be positioned within the MLST defined population structure, categorised based on its SCCmec lineage, then subtyped based on the polymorphic spa repeat region. Overall, by combining the genotyping methods described here, an integrated and novel MRSA genotyping strategy results that is efficacious for both long and short term investigations. Furthermore, an additional benefit is that each component can be performed easily and cost-effectively on a standard real-time PCR platform.

APA, Harvard, Vancouver, ISO, and other styles

17

Stephens, Alex J. "The development of rapid genotyping methods for methicillin-resistant Staphylococcus aureus." Queensland University of Technology, 2008. http://eprints.qut.edu.au/20172/.

Full text

Abstract:

Methicillin-resistant Staphylococcus aureus (MRSA) is an important human pathogen that is endemic in hospitals all over the world. It has more recently emerged as a serious threat to the general public in the form of community-acquired MRSA. MRSA has been implicated in a wide variety of diseases, ranging from skin infections and food poisoning to more severe and potentially fatal conditions, including; endocarditis, septicaemia and necrotising pneumonia. Treatment of MRSA disease is complicated and can be unsuccessful due to the bacterium's remarkable ability to develop antibiotic resistance. The considerable economic and public health burden imposed by MRSA has fuelled attempts by researchers to understand the evolution of virulent and antibiotic resistant strains and thereby improve epidemiological management strategies. Central to MRSA transmission management strategies is the implementation of active surveillance programs, via which unique genetic fingerprints, or genotypes, of each strain can be identified. Despite numerous advances in MRSA genotyping methodology, there remains a need for a rapid, reproducible, cost-effective method that is capable of producing a high level of genotype discrimination, whilst being suitable for high throughput use. Consequently, the fundamental aim of this thesis was to develop a novel MRSA genotyping strategy incorporating these benefits. This thesis explored the possibility that the development of more efficient genotyping strategies could be achieved through careful identification, and then simple interrogation, of multiple, unlinked DNA loci that exhibit progressively increasing mutation rates. The baseline component of the MRSA genotyping strategy described in this thesis is the allele-specific real-time PCR interrogation of slowly evolving core single nucleotide polymorphisms (SNPs). The genotyping SNP set was identified previously from the Multi-locus sequence typing (MLST) sequence database using an in-house software package named Minimum SNPs. As discussed in Chapter Three, the genotyping utility of the SNP set was validated on 107 diverse Australian MRSA isolates, which were largely clustered into groups of related strains as defined by MLST. To increase the resolution of the SNP genotyping method, a selection of binary virulence genes and antimicrobial resistance plasmids were tested that were successful at sub typing the SNP groups. A comprehensive MRSA genotyping strategy requires characterisation of the clonal background as well as interrogation of the hypervariable Staphylococcal Cassette Chromosome mec (SCCmec) that carries the β-lactam resistance gene, mecA. SCCmec genotyping defines the MRSA lineages; however, current SCCmec genotyping methods have struggled to handle the increasing number of SCCmec elements resulting from a recent explosion of comparative genomic analyses. Chapter Four of this thesis collates the known SCCmec binary marker diversity and demonstrates the ability of Minimum SNPs to identify systematically a minimal set of binary markers capable of generating maximum genotyping resolution. A number of binary targets were identified that indeed permit high resolution genotyping of the SCCmec element. Furthermore, the SCCmec genotyping targets are amenable for combinatorial use with the MLST genotyping SNPs and therefore are suitable as the second component of the MRSA genotyping strategy. To increase genotyping resolution of the slowly evolving MLST SNPs and the SCCmec binary markers, the analysis of a hypervariable repeat region was required. Sequence analysis of the Staphylococcal protein A (spa) repeat region has been conducted frequently with great success. Chapter Five describes the characterisation of the tandem repeats in the spa gene using real-time PCR and high resolution melting (HRM) analysis. Since the melting rate and precise point of dissociation of double stranded DNA is dependent on the size and sequence of the PCR amplicon, the HRM method was used successfully to identify 20 of 22 spa sequence types, without the need for DNA sequencing. The accumulation of comparative genomic information has allowed the systematic identification of key MRSA genomic polymorphisms to genotype MRSA efficiently. If implemented in its entirety, the strategy described in this thesis would produce efficient and deep-rooted genotypes. For example, an unknown MRSA isolate would be positioned within the MLST defined population structure, categorised based on its SCCmec lineage, then subtyped based on the polymorphic spa repeat region. Overall, by combining the genotyping methods described here, an integrated and novel MRSA genotyping strategy results that is efficacious for both long and short term investigations. Furthermore, an additional benefit is that each component can be performed easily and cost-effectively on a standard real-time PCR platform.

APA, Harvard, Vancouver, ISO, and other styles

18

Cui, Lingfei. "A Likelihood Method to Estimate/Detect Gene Flow and A Distance Method to Estimate Species Trees in the Presence of Gene Flow." The Ohio State University, 2014. http://rave.ohiolink.edu/etdc/view?acc_num=osu1406158261.

Full text

APA, Harvard, Vancouver, ISO, and other styles

19

Chavan, Archana G. "Exploring the molecular architecture of proteins| Method developments in structure prediction and design." Thesis, University of the Pacific, 2014. http://pqdtopen.proquest.com/#viewpdf?dispub=3609082.

Full text

Abstract:

Proteins are molecular machines of life in the truest sense. Being the expressors of genotype, proteins have been a focus in structural biology. Since the first characterization and structure determination of protein molecule more than half a century ago1, our understanding of protein structure is improving only incrementally. While computational analysis and experimental techniques have helped scientist view the structural features of proteins, our concepts about protein folding remain at the level of simple hydrophobic interactions packing side-chain at the core of the protein. Furthermore, because the rate of genome sequencing is far more rapid than protein structure characterization, much more needs to be achieved in the field of structural biology. As a step in this direction, my dissertation research uses computational analysis and experimental techniques to elucidate the fine structural features of the tertiary packing in proteins. With these set of studies, the knowledge of the field of structural biology extends to the fine details of higher order protein structure.

APA, Harvard, Vancouver, ISO, and other styles

20

Jiménez, Sánchez Alejandro. "Characterisation of the tumour microenvironment in ovarian cancer." Thesis, University of Cambridge, 2019. https://www.repository.cam.ac.uk/handle/1810/287935.

Full text

Abstract:

The tumour microenvironment comprises the non-cancerous cells present in the tumour mass (fibroblasts, endothelial, and immune cells), as well as signalling molecules and extracellular matrix. Tumour growth, invasion, metastasis, and response to therapy are influenced by the tumour microenvironment. Therefore, characterising the cellular and molecular components of the tumour microenvironment, and understanding how they influence tumour progression, represent a crucial aim for the success of cancer therapies. High-grade serous ovarian cancer provides an excellent opportunity to systematically study the tumour microenvironment due to its clinical presentation of advanced disseminated disease and debulking surgery being standard of care. This thesis first presents a case report of a long-term survivor (>10 years) of metastatic high-grade serous ovarian cancer who exhibited concomitant regression/progression of the metastatic lesions (5 samples). We found that progressing metastases were characterized by immune cell exclusion, whereas regressing metastases were infiltrated by CD8+ and CD4+ T cells. Through a T cell - neoepitope challenge assay we demonstrated that pre- dicted neoepitopes were recognised by the CD8+ T cells obtained from blood drawn from the patient, suggesting that regressing tumours were subjected to immune attack. Immune excluded tumours presented a higher expression of immunosuppressive Wnt signalling, while infiltrated tumours showed a higher expression of the T cell chemoattractant CXCL9 and evidence of immunoediting. These findings suggest that multiple distinct tumour immune microenvironments can co-exist within a single individual and may explain in part the hetero- geneous fates of metastatic lesions often observed in the clinic post-therapy. Second, this thesis explores the prevalence of intra-patient tumour microenvironment het- erogeneity in high-grade serous ovarian cancer at diagnosis (38 samples from 8 patients), as well as the effect of chemotherapy on the tumour microenvironment (80 paired samples from 40 patients). Whole transcriptome analysis and image-based quantification of T cells from treatment-naive tumours revealed highly prevalent variability in immune signalling and distinct immune microenvironments co-existing within the same individuals at diagnosis. ConsensusTME, a method that generates consensus immune and stromal cell gene signatures by intersecting state-of-the-art deconvolution methods that predict immune cell populations using bulk RNA data was developed. ConsensusTME improved accuracy and sensitivity of T cell and leukocyte deconvolutions in ovarian cancer samples. As previously observed in the case report, Wnt signalling expression positively correlated with immune cell exclusion. To evaluate the effect of chemotherapy on the tumour microenvironment, we compared site-matched and site-unmatched tumours before and after neoadjuvant chemotherapy. Site- matched samples showed increased cytotoxic immune activation and oligoclonal expansion of T cells after chemotherapy, unlike site-unmatched samples where heterogeneity could not be accounted for. In addition, low levels of immune activation pre-chemotherapy were found to be correlated with immune activation upon chemotherapy treatment. These results cor- roborate that the tumour-immune interface in advanced high-grade serous ovarian cancer is intrinsically heterogeneous, and that chemotherapy induces an immunogenic effect mediated by cytotoxic cells. Finally, the different deconvolution methods were benchmarked along with ConsensusTME in a pan-cancer setting by comparing deconvolution scores to DNA-based purity scores, leukocyte methylation data, and tumour infiltrating lymphocyte counts from image analysis. In so far as it has been benchmarked, unlike the other methods, ConsensusTME performs consistently among the top three methods across cancer-related benchmarks. Additionally, ConsensusTME provides a dynamic and evolvable framework that can integrate newer de- convolution tools and benchmark their performance against itself, thus generating an ever updated version. Overall, this thesis presents a systematic characterisation of the tumour microenvironment of high grade serous ovarian cancer in treatment-naive and chemotherapy treated samples, and puts forward the development of an integrative computational method for the systematic analysis of the tumour microenvironment of different tumour types using bulk RNA data.

APA, Harvard, Vancouver, ISO, and other styles

21

Rivas, Cruz Manuel A. "Medical relevance and functional consequences of protein truncating variants." Thesis, University of Oxford, 2015. http://ora.ox.ac.uk/objects/uuid:a042ca18-7b35-4a62-aef0-e3ba2e8795f7.

Full text

Abstract:

Genome-wide association studies have greatly improved our understanding of the contribution of common variants to the genetic architecture of complex traits. However, two major limitations have been highlighted. First, common variant associations typically do not identify the causal variant and/or the gene that it is exerting its effect on to influence a trait. Second, common variant associations usually consist of variants with small effects. As a consequence, it is more challenging to harness their translational impact. Association studies of rare variants and complex traits may be able to help address these limitations. Empirical population genetic data shows that deleterious variants are rare. More specifically, there is a very strong depletion of common protein truncating variants (PTVs, commonly referred to as loss-of-function variants) in the genome, a group of variants that have been shown to have large effect on gene function, are enriched for severe disease-causing mutations, but in other instances may actually be protective against disease. This thesis is divided into three parts dedicated to the study of protein truncating variants, their medical relevance, and their functional consequences. First, I present statistical, bioinformatic, and computational methods developed for the study of protein truncating variants and their association to complex traits, and their functional consequences. Second, I present application of the methods to a number of case-control and quantitative trait studies discovering new variants and genes associated to breast and ovarian cancer, type 1 diabetes, lipids, and metabolic traits measured with NMR spectroscopy. Third, I present work on improving annotation of protein truncating variants by studying their functional consequences. Taken together, these results highlight the utility of interrogating protein truncating variants in medical and functional genomic studies.

APA, Harvard, Vancouver, ISO, and other styles

22

Sinclair, Lucas. "Molecular methods for microbial ecology : Developments, applications and results." Doctoral thesis, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-297613.

Full text

Abstract:

Recent developments in DNA sequencing technology allow the study of microbial ecology at unmatched detail. To fully embrace this revolution, an important avenue of research is the development of bioinformatic tools that enable scientists to leverage and manipulate the exceedingly large amounts of data produced. In this thesis, several bioinformatic tools were developed in order to process and analyze metagenomic sequence data. Subsequently, the tools were applied to the study of microbial biogeography and microbial systems biology. A targeted metagenomics pipeline automating quality filtering, joining and taxonomic annotation was developed to assess the diversity of bacteria, archaea and eukaryotes permitting the study of biogeographic patterns in great detail. Next, a second software package which provides annotation based on environmental ontology terms was coded aiming to exploit the cornucopia of information available in public databases. It was applied to resource tracking, paleontology, and biogeography. Indeed, both these tools have already found broad applications in extending our understanding of microbial diversity in inland waters and have contributed to the development of conceptual frameworks for microbial biogeography in lotic systems. The programs were used for analyzing samples from several environments such as alkaline soda lakes and ancient sediment cores. These studies corroborated the view that the dispersal limitations of microbes are more or less non-existant as environmental properties dictating their distribution and that dormant microbes allow the reconstruction of the origin and history of the sampled community. Furthermore, a shotgun metagenomics analysis pipeline for the characterization of total DNA extraction from the environment was put in place. The pipeline included all essential steps from raw sequence processing to functional annotation and reconstruction of prokaryotic genomes. By applying this tool, we were able to reconstruct the biochemical processes in a selection of systems representative of the tens of millions of lakes and ponds of the boreal landscape. This revealed the genomic content of abundant and so far undescribed prokaryotes harboring important functions in these ecosystems. We could show the presence of organisms with the capacity for photoferrotrophy and anaerobic methanotrophy encoded in their genomes, traits not previously detected in these systems. In another study, we showed that microbes respond to alkaline conditions by adjusting their energy acquisition and carbon fixation strategies. To conclude, we demonstrated that the "reverse ecology" approach in which the role of microbes in elemental cycles is assessed by genomic tools is very powerful as we can identify novel pathways and obtain the partitioning of metabolic processes in natural environments.

APA, Harvard, Vancouver, ISO, and other styles

23

"Development of bioinformatics algorithms for trisomy 13 and 18 detection by next generation sequencing of maternal plasma DNA." 2011. http://library.cuhk.edu.hk/record=b5894869.

Full text

Abstract:

Chen, Zhang.
Thesis (M.Phil.)--Chinese University of Hong Kong, 2011.
Includes bibliographical references (p. 109-114).
Abstracts in English and Chinese.
ABSTRACT --- p.I
摘要 --- p.III
ACKNOWLEDGEMENTS --- p.IV
PUBLICATIONS --- p.VI
CONTRIBUTORS --- p.VII
TABLE OF CONTENTS --- p.VIII
LIST OF TABLES --- p.XIII
LIST OF FIGURES --- p.XIV
LIST OF ABBREVIATIONS --- p.XVI
Chapter SECTION I : --- BACKGROUND --- p.1
Chapter CHAPTER 1: --- PRENATAL DIAGNOSIS OF FETAL TRISOMY BY NEXT GENERATION SEQUENCING TECHNOLOGY --- p.2
Chapter 1.1 --- FETAL TRISOMY --- p.2
Chapter 1.2 --- CONVENTIONAL PRENATAL DIAGNOSIS OF FETAL TRISOMIES --- p.3
Chapter 1.3 --- CELL FREE FETAL D N A AND ITS APPLICATION IN PRENATAL DIAGNOSIS --- p.5
Chapter 1.4 --- NEXT GENERATION SEQUENCING TECHNOLOGY --- p.5
Chapter 1.5 --- SUBSTANTIAL BIAS IN THE NEXT GENERATION SEQUENCING PLATFORM --- p.9
Chapter 1.6 --- PRENATAL DIAGNOSIS OF TRISOMY BY NEXT GENERATION SEQUENCING --- p.10
Chapter 1.7 --- AIMS OF THIS THESIS --- p.11
Chapter SECTION I I : --- MATERIALS AND METHODS --- p.13
Chapter CHAPTER 2: --- METHODS FOR NONINVASIVE PRENATAL DIAGNOSIS OF FETAL TRISOMY MATERNAL PLASMA DNA SEQUENCING --- p.14
Chapter 2.1 --- STUDY DESIGN AND PARTICIPANTS --- p.14
Chapter 2.1.1 --- Ethics Statement --- p.14
Chapter 2.1.2 --- "Study design, setting and participants" --- p.14
Chapter 2.2 --- MATERNAL PLASMA D N A SEQUENCING --- p.17
Chapter 2.3 --- SEQUENCING DATA ANALYSIS --- p.18
Chapter SECTION I I I : --- TRISOMY 13 AND 18 DETECTION BY THE T21 BIOINFORMATICS ANALYSIS PIPELINE --- p.21
Chapter CHAPTER 3: --- THE T21 BIOINFORMATICS ANALYSIS PIPELINE FOR TRISOMY 13 AND 18 DETECTION --- p.22
Chapter 3.1 --- INTRODUCTION --- p.22
Chapter 3.2 --- METHODS --- p.23
Chapter 3.2.1 --- Bioinformatics analysis pipeline for trisomy 13 and 18 detection --- p.23
Chapter 3.3 --- RESULTS --- p.23
Chapter 3.3.1 --- Performance of the T21 bioinformatics analysis pipeline for trisomy 13 and 18 detection --- p.23
Chapter 3.3.2 --- The precision of quantifying chrl 3 and chrl 8 --- p.27
Chapter 3.4 --- DISCUSSION --- p.29
Chapter SECTION IV : --- IMPROVING THE T21 BIOINFORMATICS ANALYSIS PIPELINE FOR TRISOMY 13 AND 18 DETECTION --- p.30
Chapter CHAPTER 4: --- IMPROVING THE ALIGNMENT --- p.31
Chapter 4.1 --- INTRODUCTION --- p.31
Chapter 4.2 --- METHODS --- p.32
Chapter 4.2.1 --- Allowing mismatches in the index sequences --- p.32
Chapter 4.2.2 --- Calculating the mappability of the human reference genome --- p.33
Chapter 4.2.3 --- Aligning reads to the non-repeat masked human reference genome --- p.34
Chapter 4.2.4 --- Trisomy 13 and 18 detection --- p.34
Chapter 4.3 --- RESULTS --- p.34
Chapter 4.3.1 --- Increasing read numbers by allowing mismatches in the index sequences --- p.34
Chapter 4.3.2 --- Increasing read numbers by using the non-masked reference genome for alignment . --- p.38
Chapter 4.3.3 --- Allowing mismatches in the read alignment --- p.42
Chapter 4.3.4 --- The performance of trisomy 13 and 18 detection after improving the alignment --- p.47
Chapter 4.4 --- DISCUSSION --- p.50
Chapter CHAPTER 5: --- REDUCING THE GC BIAS BY CORRECTION OF READ COUNTS --- p.53
Chapter 5.1 --- INTRODUCTION --- p.53
Chapter 5.2 --- METHODS --- p.54
Chapter 5.2.1 --- Read alignment --- p.54
Chapter 5.2.2 --- Calculating the correlation between GC content and read counts --- p.55
Chapter 5.2.3 --- GC correction in read counts --- p.55
Chapter 5.2.4 --- Trisomy 13 and 18 detection --- p.56
Chapter 5.3 --- RESULTS --- p.56
Chapter 5.3.1 --- GC bias in plasma DNA sequencing --- p.56
Chapter 5.3.2 --- Correcting the GC bias in read counts by linear regression --- p.59
Chapter 5.3.3 --- Correcting the GC bias in read counts by LOESS regression --- p.65
Chapter 5.3.4 --- Bin size --- p.72
Chapter 5.4 --- DISCUSSION --- p.75
Chapter CHAPTER 6: --- REDUCING THE GC BIAS BY MODIFYING THE GENOMIC REPRESENTATION CALCULATION --- p.77
Chapter 6.1 --- INTRODUCTION --- p.77
Chapter 6.2 --- METHODS --- p.78
Chapter 6.2.1 --- Modifying the genomic representation calculation --- p.78
Chapter 6.2.2 --- Trisomy 13 and 18 detection --- p.78
Chapter 6.2.3 --- Combining GC correction and modified genomic representation --- p.78
Chapter 6.3 --- RESULTS --- p.79
Chapter 6.3.1 --- Reducing the GC bias by modifying genomic representation calculation --- p.79
Chapter 6.3.2 --- Combining GC correction and modified genomic representation --- p.86
Chapter 6.4 --- DISCUSSION --- p.89
Chapter CHAPTER 7: --- IMPROVING THE STATISTICS FOR TRISOMY 13 AND 18 DETECTION --- p.91
Chapter 7.1 --- INTRODUCTION --- p.91
Chapter 7.2 --- METHODS --- p.92
Chapter 7.2.1 --- Comparing chrl 3 or chrl8 with other chromosomes within the sample --- p.92
Chapter 7.2.2 --- Comparing chrl 3 or chrl 8 with the artificial chromosomes --- p.92
Chapter 7.3 --- RESULTS --- p.93
Chapter 7.3.1 --- Determining the trisomy 13 and 18 status by comparing chromosomes within the samples --- p.93
Chapter 7.3.2 --- Determining the trisomy 13 and 18 status by comparing chrl3 or chrl 8 with artificial chromosomes --- p.97
Chapter 7.4 --- DISCUSSION --- p.100
Chapter SECTION V : --- CONCLUDING REMARKS --- p.102
Chapter CHAPTER 8: --- CONCLUSION AND FUTURE PERSPECTIVES --- p.103
Chapter 8.1 --- THE PERFORMANCE OF THE T21 BIOINFORMATICS ANALYSIS PIPELINE DEVELOPED FOR TRISOMY 21 DETECTION IS SUBOPTIMAL FOR TRISOMY 13 AND 18 DETECTION --- p.103
Chapter 8.2 --- THE ALIGNMENT COULD BE IMPROVED BY ALLOWING ONE MISMATCH IN THE INDEX AND USING THE NON-REPEAT MASKED HUMAN REFERENCE GENOME AS THE ALIGNMENT REFERENCE --- p.104
Chapter 8.3 --- THE PRECISION OF QUANTIFYING CHR13 AND CHR18 COULD BE IMPROVED BY THE G C CORRECTION OR THE MODIFIED GENOMIC REPRESENTATION --- p.104
Chapter 8.4 --- THE STATISTICS FOR TRISOMY 13 AND 18 DETECTION COULD BE IMPROVED BY COMPARING CHR13 OR CHR18 WITH ARTIFICIAL CHROMOSOMES WITHIN THE SAMPLE --- p.105
Chapter 8.5 --- PROSPECTS FOR FUTURE WORK --- p.106
REFERENCE --- p.109

APA, Harvard, Vancouver, ISO, and other styles

24

Rohde, Christian. "Development of experimental and bioinformatics methods for high resolution DNA methylation analysis of gene promoters on human chromosome 21 /." 2009. http://www.jacobs-university.de/phd/files/1254838231.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

25

Reed, Eric R. "Development of advanced methods for large-scale transcriptomic profiling and application to screening of metabolism disrupting compounds." Thesis, 2020. https://hdl.handle.net/2144/41943.

Full text

Abstract:

High-throughput transcriptomic profiling has become a ubiquitous tool to assay an organism transcriptome and to characterize gene expression patterns in different cellular states or disease conditions, as well as in response to molecular and pharmacologic perturbations. Refinements to data preparation techniques have enabled integration of transcriptomic profiling into large-scale biomedical studies, generally devised to elucidate phenotypic factors contributing to transcriptional differences across a cohort of interest. Understanding these factors and the mechanisms through which they contribute to disease is a principal objective of numerous projects, such as The Cancer Genome Atlas and the Cancer Cell Line Encyclopedia. Additionally, transcriptomic profiling has been applied in toxicogenomic screening studies, which profile molecular responses of chemical perturbations in order to identify environmental toxicants and characterize their mechanisms-of-action. Further adoption of high-throughput transcriptomic profiling requires continued effort to improve and lower the costs of implementation. Accordingly, my dissertation work encompasses both the development and assessment of cost-effective RNA sequencing platforms, and of novel machine learning techniques applicable to the analyses of large-scale transcriptomic data sets. The utility of these techniques is evaluated through their application to a toxicogenomic screen in which our lab profiled exposures of adipocytes to metabolic disrupting chemicals. Such exposures have been implicated in metabolic dyshomeostasis, the predominant cause of obesity pathogenesis. Considering that an estimated 10% of the global population is obese, understanding the role these exposures play in disrupting metabolic balance has the potential to help combating this pervasive health threat. This dissertation consists of three sections. In the first section, I assess data generated by a highly-multiplexed RNA sequencing platform developed by our section, and report on its significantly better quality relative to similar platforms, and on its comparable quality to more expensive platforms. Next, I present the analysis of a toxicogenomic screen of metabolic disrupting compounds. This analysis crucially relied on novel supervised and unsupervised machine learning techniques which I specifically developed to take advantage of the experimental design we adopted for data generation. Lastly, I describe the further development, evaluation, and optimization of one of these methods, K2Taxonomer, into a computational tool for unsupervised molecular subgrouping of bulk and single-cell gene expression data, and for the comprehensive in-silico annotation of the discovered subgroups.

APA, Harvard, Vancouver, ISO, and other styles

26

Venkatraman, Anand. "Validation of a novel expressed sequence tag (EST) clustering method and development of a phylogenetic annotation pipeline for livestock gene families." Thesis, 2008. http://hdl.handle.net/1969.1/ETD-TAMU-3112.

Full text

Abstract:

Prediction of functions of genes in a genome is a key step in all genome sequencing projects. Sequences that carry out important functions are likely to be conserved between evolutionarily distant species and can be identified using cross-species comparisons. In the absence of completed genomes and the accompanying high-quality annotations, expressed sequence tags (ESTs) from random cDNA clones are the primary tools for functional genomics. EST datasets are fragmented and redundant, necessitating clustering of ESTs into groups that are likely to have been derived from the same genes. EST clustering helps reduce the search space for sequence homology searching and improves the accuracy of function predictions using EST datasets. This dissertation is a case study that describes clustering of Bos taurus and Sus scrofa EST datasets, and utilizes the EST clusters to make computational function predictions using a comparative genomics approach. We used a novel EST clustering method, TAMUClust, to cluster bovine ESTs and compare its performance to the bovine EST clusters from TIGR Gene Indices (TGI) by using bovine ESTs aligned to the bovine genome assembly as a gold standard. This comparison study reveals that TAMUClust and TGI are similar in performance. Comparisons of TAMUClust and TGI with predicted bovine gene models reveal that both datasets are similar in transcript coverage. We describe here the design and implementation of an annotation pipeline for predicting functions of the Bos taurus (cattle) and Sus scrofa (pig) transcriptomes. EST datasets were clustered into gene families using Ensembl protein family clusters as a framework. Following clustering, the EST consensus sequences were assigned predicted function by transferring annotations of the Ensembl vertebrate protein(s) they are grouped to after sequence homology searches and phylogenetic analysis. The annotations benefit the livestock community by helping narrow down the gamut of direct experiments needed to verify function.

APA, Harvard, Vancouver, ISO, and other styles

27

Mainz, Indra [Verfasser]. "Development and implementation of techniques for ontology engineering and an ontology-based search for bioinformatics tools and methods / vorgelegt von Indra Mainz." 2008. http://d-nb.info/99269776X/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

28

Krishnadev, O. "Development And Applications Of Computational Methods To Aid Recognition Of Protein Functions And Interactions." Thesis, 2010. http://etd.iisc.ernet.in/handle/2005/1457.

Full text

Abstract:

Protein homology detection has played a central role in the understanding of evolution of protein structures, functions and interactions. Many of the developments in protein bioinformatics can be traced back to an initial step of homology detection. It is not surprising then, that extension of remote homology detection has gained a lot of attention in the recent past. The explosive growth of genome sequences and the slow pace of experimental techniques have thrust computational analyses into the limelight. It is not surprising to see that many of the traditional experimental areas such as gene expression analysis, recognition of function and recognition of 3-D structure have been attempted effectively by computational approaches. The idea behind homology-based bioinformatics work is the fact that the hereditary mechanisms ensure that the parent generation gives rise to a very similar offspring generation. Since biological functions of proteins of an organism are product of expression of its genetic material, it follows that the genes of an organism should show conservation from one generation to another (with very few mutations if parent and offspring generation have to be nearly identical) Thus, if it can be established that two proteins have descended from a common ancestor, then it can be inferred that the biological functions of the two proteins could be very similar. Thus, homology-based information transfer from one protein to another has become a commonly used procedure in protein bioinformatics. The ability to recognize homologs of a protein solely from amino acid sequences has seen a steady increase in the last two decades. However, currently, still there are a large number of proteins of known amino acid sequence and yet unknown function . Thus, a major goal of current computational work is to extend the limits of remote homology detection to enable the functional characterization of proteins of unknown function. Since proteins do not work in isolation in a cell, it has become essential to understand the in vivo context of the function of a protein. For this purpose, it is essential to have an understanding of all the molecules that interact with a particular protein. Thus, another major area of bioinformatics has been to integrate biological information with protein-protein interactions to enable a better understanding of the molecular processes. Such attempts have been made successfully for the interaction network of proteins within an organism. The extension of the interaction network analysis to a host-pathogen scenario can lead to useful insights into pathophysiology of diseases. The work done as part of the thesis explores both the ideas mentioned above, namely, the extension of limits of remote homology detection and prediction of protein-protein interactions between a pathogen and its host. Since the work can logically be divided into two different areas though there is a connection, the thesis is organized as two parts. The first part of the thesis (comprising Chapters 2, 3, 4 and 5) describes the development and application of remote homology detection tools for function/structure annotation. The second part of the thesis (comprising of Chapters 6, 7, 8 and 9) describes the development and application of a homology-based procedure for detection of host-pathogen protein-protein interactions. Chapter 1 provides a background and literature survey in the areas of homology detection and prediction of protein-protein interactions. It is argued that homology-based information transfer is currently an important tool in the prediction and recognition of protein structures, functions and interactions. The development of remote homology detection methods and its effect on function recognition has been highlighted. Recent work in the area of prediction of protein-protein interactions using homology to known interaction templates is described and it is implied to be a successful approach for prediction of protein-protein interactions on a genome scale. The importance of further improvements in remote homology detection (as done in the first part of the thesis), is emphasized for annotation of proteins in newly sequenced genomes. The importance of application of homology detection methods in predicting protein-protein interactions across host-pathogen organisms is also explored. Chapter 2 analyzes the performance of the PSI-BLAST, one of the well-known and very effective approaches for recognition of related proteins, for remote homology detection. The chapter describes in detail the working of the PSI-BLAST algorithm and focuses on three parameters that determine the time required for searching in a large database, and also provide a ceiling for the sensitivity of the search procedure. The parameters that have been analyzed are the window size for two-hit method, the threshold for extension of an initial hit to dynamic programming and the extent of dependence on the query as encompassed in the profile generation step. The procedure followed for the analysis is to consider a large database of known evolutionary relationships (SCOP database was chosen for the analysis), and use the PSI-BLAST program at different values of three parameters to find out the effect on sensitivity (defined as the normalized number of correct SCOP superfamily relationships found in a search), and the time required for completion of the search. For the demonstration of the effect on the query dependence, a multiple sequence alignment (MSA) of a SCOP family (generated from all family sequences using ClustalW), was used with multiple queries to derive profiles in PSI-BLAST runs. The increase in sensitivity and the increase in time required for completion of each search were then monitored. The effect of changing the two PSI-BLAST internal parameters of score threshold for extension of word hits and the window size for the two-hit method do not result in a significant increase in sensitivity. Since PSI-BLAST uses the amino acid residues present in the query sequence to derive the Position Specific Scoring Matrix (PSSM) parameters, there is a strong query dependence on the sensitivity of each PSSM. Using multiple PSSMs derived from a single MSA can thus help overcome the query dependence and increase the sensitivity. In this Chapter such an approach, named as MulPSSM, has been demonstrated to have higher sensitivity than single profiles approach, (by up to two times more) in a benchmark dataset of 100 randomly chosen SCOP folds. Strategies to optimize sensitivity and the time required in searching MulPSSM have been explored and it is found that use of a non-redundant set of queries to generate MulPSSM can reduce the time required for each search while not affecting the sensitivity by a large degree. The application of the MulPSSM approach in function annotation of proteins in completely sequenced genomes was explored by searching genomic sequences in a MulPSSM database of Pfam families. The association of function to proteins has been assessed when both single profile per family database and MulPSSM database of families were used. It is found that in a comprehensive list of 291 genomes of Prokaryotes, 44 genomes of Eukaryotes and 40 genomes of Archea, that on an average MulPSSM is able to identify evolutionary relationships for 10% more proteins in a genome than single profiles-based approach. Such an enhancement in the recognition of evolutionary relationships, which has an implication in obtaining clues to functions, can help in more efficient exploration of newly sequenced genomes. Identification of evolutionary relationships involving some of the proteins of M. tuberculosis and M. leprae has been possible due to the use of multiple profiles search approach which is discussed in this chapter. The examples of annotations provided in the chapter include enzymes that are involved in glyco lipids synthesis which are vital for the survival of the pathogens inside the host and such annotations can help in expanding our knowledge of these processes. Chapter 3 describes the development and assessment of a sensitive remote homology detection method. The sensitivity of remote homology detection methods has been steadily increasing in the past decade and profile analysis has become a mainstay of such efforts. The profile is a probabilistic model of substitutions allowed at each position in a sequence family, and hence captures the essential features of a family. Alignment of two such profiles is thus considered to provide a more sensitive and accurate method than the alignment of two sequences. The performance of HMMs (Hidden Markov Models) has been shown to be higher than PSSMs (Position Specific Scoring Matrix). Thus, a profile-profile alignment using HMMs can in principle give the best possible sensitivity in remote homology detection. Many investigators have incorporated residue conservation and secondary structure information to align two HMMs, and such additional information has been demonstrated to provide better sensitivity in remote homology detection (for instance in the HHSearch program). The work presented in Chapter 3, extends the idea of incorporating additional information such as explicit hydrophobicity information, along with conservation and predicted secondary structure over a window of Multiple Sequence Alignment (MSA) columns in aligning HMMs. The new algorithm is named AlignHUSH (Alignment of HMMs Using Secondary structure and Hydrophobicity). The HMMs used in the work are derived from structural alignments using HMMER program and are taken from the publicly available superfamily database which provides HMMs for all the SCOP families. The HMMs are modified into two-state HMMs by collapsing the ‘insert’ and ‘delete’ states into a ‘non-match’ state in the AlignHUSH algorithm. The two state HMMs enables the use of dynamic programming methods and keeps intact the position-specific gap penalties. The two state HMMs can be more readily extended to alignment of PSSMs. The incorporation of secondary structure information is made using secondary structure predictions made using PSIPRED program. The hydrophobicity information is calculated using the Kyte Doolittle hydrophobicity values. The alignment is generated by scoring each position using the values present in a window of residues. The assessment of alignment accuracy is done by comparison to manually curated alignments present in the BaliBASE database. A detailed description of the optimization steps followed for obtaining the values for each score contribution (conservation, secondary structure and hydrophobicity) is provided. The assessment revealed that a high weightage to conservation score (18.0) and low weightage to the secondary structure score (1.5) and hydrophobicity (1.0) is optimal. The use of residue windows in alignment has been shown to dramatically increase the sensitivity (around 30% on a small dataset comprising 10% of total SCOP domains). The sensitivity of AlignHUSH algorithm in comparison to other HMM-HMM alignment methods HHSearch and PRC in an all-against-all comparison of SCOP 1.69 database demonstrates that AlignHUSH has better sensitivity than both HHSearch and PRC (approximately by 10% and 5% respectively). The alignment accuracy calculated as the ratio of correctly aligned residues and all alignment positions in BaliBASE alignments reveals that AlignHUSH algorithm provides an accuracy comparable or marginally higher than both HHSearch and PRC (25% for AlignHUSH and roughly 17% for both HHSearch and PRC). A few examples of structural relationships between SCOP families belonging to different folds and/or classes are presented in the chapter to illustrate the strength of AlignHUSH in detecting very remote relationships. Chapter 4 describes a database of evolutionary relationships identified between Pfam families. The grouping of Pfam families is important for obtaining better understanding on evolutionary relationships and in obtaining clues to functions of proteins in families of yet unknown function. Much effort has been taken by various investigators in bringing many proteins in the sequence databases within homology modeling distance with a protein of known structure. Structural genomics initiatives spend considerable effort in achieving this goal. The results from such experiments suggest that in many cases after the structure has been solved using X-ray crystallography or NMR methods, the protein is seen to have structural similarity to a protein of already known structure. Thus, an inability to detect such remote relationships severely impairs the efficiency of structural genomics initiatives. The development of the SUPFAM method was made earlier in the group to enable detection of distant relationships between Pfam families. In SUPFAM approach, relationships are detected by mapping the Pfam families to SCOP families. Further, using the implicit or explicit evolutionary relationship information present in the SCOP database relationships between Pfam families are detected. The work presented in this chapter is an improvement of previous development using the significantly more sensitive AlignHUSH method to uncover more relationships. The new database follows a procedure slightly different than the older SUPFAM database and hence is called SUPFAM+. The relative improvement brought by SUPFAM+ has been discussed in detail in the chapter. The methodology followed for the analysis is to first generate SUPFAM database by recognition of relationships between Pfam families and SCOP families using PSI BLAST / RPS BLAST. For the generation of SUPFAM+ database, recognition of relationships between Pfam families and SCOP families is done using AlignHUSH. The criteria are kept stringent at this stage to minimize the rate of false positives. In cases of a Pfam family mapping to two or more SCOP superfamilies, a semi-automated decision tree is used to assign the Pfam family to a single SCOP superfamily. Some of the Pfam families which remain without a mapping to a SCOP family are mapped indirectly to a SCOP family by identifying relationships between such Pfam families and other Pfam families which are already mapped to a SCOP family. In the final step, the Pfam families still without a SCOP family mapping are mapped onto one another to form ‘Potential New Superfamilies’ (PNSF), which are excellent targets for structural genomics since none of the proteins in such PNSFs have a recognizable homologue of known structure. The clustering of Pfam families into Superfamilies belonging to SCOP 1.69 version, were then queried to check if a structure has been solved for these Pfam families subsequent to the release of the SCOP 1.69 database. The latest SCOP database reveals that for close to 87 Pfam families a structure was solved which is at best related at a SCOP superfamily level with a family present in SCOP 1.69. An analysis of the mappings provided by SUPFAM+ database reveals that the mappings are correct in 85% of the cases at the SCOP superfamily level. An in-depth analysis revealed that among the rest of the cases, only one can be adjudged as an incorrect mapping. Many of the inconsistent mappings were found to be due to the absence of the SCOP fold in the SCOP 1.69 release, although interestingly the mapping provided by SUPFAM+ database shows structural similarity to the actual fold for the Pfam family found subsequently. A straightforward comparison with a similar database (Pfam Clans database) reveals that the SUPFAM+ database could suggest four times more pairwise relationships between Pfam families than the Pfam Clans database. Thus, since the structural mappings provided in the SUPFAM+ database are very accurate the relationships found in the database could help in function annotation of uncharacterized protein families (explored in Chapter 5). The accuracy of mapping would be similar for the PNSFs, and hence these clusters can be excellent targets for structural genomics initiatives. The classiﬁcation of families based on sequence/structural similarities can also be useful for function annotation of families of uncharacterized proteins, and such an idea is explored in the next chapter. Chapter 5 describes the attempts made to obtain clues to the structure and/or function of the DUF (Domain of Unknown Function) families present in the Pfam database. Currently, the DUF families populate around 21% of the Pfam database (2260 out of 10340). Thus, although homologues for each of the proteins in these families can be recognized in sequence databases, the homology does not provide obvious insight into the function of these proteins. The annotation of such difficult targets is a major goal of computational biologists in the post-genomic era. The development of a sensitive profile-profile alignment method as part of this thesis, gives an excellent opportunity to increase the number of annotations for proteins, especially in the DUF families, since a profile for these families exists in the Pfam database. The method followed for the analysis is similar to the SUPFAM+ development, and involved generation of Pfam profiles compatible with the AlignHUSH method. For the analysis presented in the chapter, relationships found between DUF families and SCOP families were analyzed. In benchmarks using the AlignHUSH method, it was found that a Z score of 5.0 gives a 10% error rate, and a Z score of 7.5 gives an error rate of 1%, and hence a minimum Z score cutoff of 7.5 was used in the analysis. A very high Z score in AlignHUSH is usually seen in cases, when sequence identity is also high, so a maximum Z score cutoﬀ of 12.0 was used to find DUF families which are difficult to annotate using other profile based methods (such as PSI-BLAST). For some of the DUF families, subsequent structure determination of one of the proteins had been reported in literature, and these cases were used to assess the accuracy of structural annotation using AlignHUSH. In other cases, fold recognition was done using the PHYRE method to ensure that the structure mappings are corroborated by fold recognition. In all cases studied, the alignment of the DUF family with the SCOP family was generated and queried for conservation of active site residues reported for each homologous SCOP family in the CSA (Catalytic Site Atlas) database. The assessment on 8 DUF families for which structure was solved subsequent to the SCOP release used in the analysis, reveals that in all cases, the correct structure was identified using the AlignHUSH procedure. In the eight cases of validated structure annotation, the conservation of active site residues was seen pointing to the effectiveness of AlignHUSH and its use in function annotation. The 27 cases in which a structure for any one of the proteins in the DUF family is not known, the fold recognition attempts suggest that in all cases, the results from fold recognition corroborate the suggestion made by AlignHUSH. The alignments of each of the DUF families with the suggested homologous SCOP family reveals that in many cases the active site residues are not conserved or are substituted by different residues. An in-depth analysis of some cases reveals that the non-conservation of residues occurs between two SCOP families in the same SCOP superfamily. Thus, although structure annotation can be reliably provided for all the DUF families studied, the exact biochemical function could be detected only for those cases in which active site conservation is seen even among distantly related families (such as two SCOP families in the same SCOP superfamily). The development and application of methods for remote homology detection has been made successfully and it has been demonstrated in the first part of the thesis that there is scope for extending the limits of remote homology detection. The use of sequence derived information in aligning profiles makes the procedure generally applicable and has been applied successfully for the case of structure/function recognition in the DUF families. In the next part of the thesis, a method for prediction of protein-protein interactions between a host and pathogen organism and its application to three groups of pathogens is presented. Chapter 6 describes the development of a procedure for prediction of protein-protein interactions (PPI) between a pathogen and its host organism. In the past, prediction of PPI has been attempted for proteins of a given organism. This was often approached by identifying proteins of the organism of interest that are homologous to two interacting proteins of another organism. A study of conservation of interactions as a function of sequence identity has been made in the past by various groups, which reveal that homologues sharing a sequence identity greater than about 30% interact in similar way. This fact can be used, along with a high quality database of protein-protein interactions to predict interactions between proteins of same organism. The work done in this thesis is one of the first attempts at extending the idea to the prediction of interactions between two different organisms. Homology of proteins from a pathogen and its host to proteins which are known to interact with each other would suggest that the proteins from pathogen and host can interact. The feasibility of such an interaction to occur under in vivo conditions need to be addressed for biologically meaningful predictions. These issues have been dealt with in this part of the thesis. One of the main steps in the procedure for the prediction of PPI is identification of homologues of pathogen and host proteins to interacting proteins listed in PPI databases. Two template PPI databases have been used in this work. One of the databases is the DIP database which provides a list of interactions based on genome-scale yeast-two-hybrid data or small scale experiments. The other database used is the iPfam database which provides interaction templates (Pfam families) based on protein complexes of known structure present in Protein Data Bank (PDB). Thus, the two databases are both comprehensive and are of high quality. The search for homologues in the DIP database was made using PSI-BLAST with stringent cutoffs for various parameters to minimize false positives. The search in iPfam database is done using RPS-BLAST and MulPSSM using stringent cutoffs. The cutoffs for the searches were fixed based on an assessment of conservation of putative interacting residues in the host and pathogen proteins as compared to the protein complexes of known structure. The predictions made are analyzed manually to assess the importance to the pathogenesis of the disease under consideration. In this chapter, in order to obtain an idea about robustness of this approach, PPI prediction was made for the phage-bacteria system and the herpes virus – human system which have been experimentally studied extensively and hence opportunities exist to compare the “predictions” with experimental results. The prediction of phage – bacteria interactions suggests that the gross biological features of the pathogenesis have been captured in the predictions. The GO (Gene Ontology) based annotations for the bacterial proteins predicted to interact suggests that the predictions involve proteins participating in DNA replication and protein synthesis. Many of the known interactions such as between the lambda phage repressor and RecA protein of bacteria were also ‘predicted’ in the analysis. A few novel interactions were predicted. For example interaction between a tail component protein and a protein of unknown function, YeeJ in E.coli has been predicted. The prediction of interactions between Herpes Virus 8 and human host and its comparison to a set of experimentally veriﬁed interactions reported in literature suggested that close to 50% of the known interactions were ‘predicted’ by the procedure followed. A few novel cases of interaction between the viral proteins and the p53 protein have also been made which might help in understanding the tumorigenesis of the viral disease. A comparison between the procedure followed in this thesis and the results from another genome-scale method (proposed by Andrej Sali and coworkers) suggests that although the proteins involved in predicted interactions from two methods may diﬀer, the functions of the proteins concerned suggested by GO annotations are highly correlated (greater than 98%). In the next few chapters, the prediction of interactions for diﬀerent host-pathogen systems is described. In the Chapter 7, the prediction of PPI between a Eukaryotic malarial pathogen, P.falciparum and its human host is described. The malarial parasite was chosen because of the extensive work reported in the literature on this pathogen in the recent years. Also, the gene expression patterns in the pathogen are highly correlated to the human tissue types with each stage of the pathogen occurring in a distinct tissue type. Thus, the biological context of the PPI can be explicitly assessed, which makes this example a well suited case for the procedure described in the Chapter 6 of this thesis. The pathogen is important from a medical perspective since there has been a recent emergence of P.falciparum induced malaria which is unresponsive to conventional drugs. Thus, studies of this parasite have gained an importance in the post genomic era. The difficulty in identifying homologues of many of the P.falciparum proteins makes this a challenging case study. Prediction of PPI between the malarial parasite and the human proteins has been approached in the same way as described in Chapter 6, with the cutoffs in homology searches kept stringent. However, in this case effective use of available additional biological data has been possible. The tissue specific expression information for human proteins has been obtained from the Atlas of Human transcriptome, and the NCBI GEO database. The pathogen stage-specific expression data has been obtained from multiple genome-scale experiments reported in the literature. The subcellular localization of both human and pathogen proteins has been predicted and hence this information is given low weightage in subsequent analysis. The prediction of PPI between malarial parasite and human, resulted in a total of more than 30,000 interactions which were compatible in an in vivo condition according to the expression data. Further reduction in the set of predicted interactions was made by incorporating the subcellular localization predictions (reduced to around 2000 interactions). Manual analysis of each of these interactions taking aid from literature on malarial parasites reveals that many of the known PPI are also ‘predicted’ in the analysis such as the interaction between SSP2 protein of P.falciparum and human ICAMs. For many proteins known to be important for pathogenesis, such as the RESA antigen, novel interactions were predicted that could help in better understanding of the pathogen. For some of the novel predicted interactions, such as that between the parasite Plasmepsin and human Spectrin, there exists circumstantial experimental evidence of interaction. Among many other novel interactions, the procedure used could predict interactions for 441 ‘hypothetical proteins’ of unknown function coded in the genome of the pathogen. The comprehensive list of predictions made using the procedure and an exploration of its biological significance can lead to novel hypothesis regarding the parthenogenesis of malaria and hence the work presented in this chapter can be helpful for further experimental exploration of the pathogen. The success of the procedure in predicting known interactions as well as novel interactions in a Eukaryotic pathogen suggests that the procedure developed is generally applicable. However it must be pointed out that in many cases of host-pathogen systems, such extensive expression and localization data may not be available, which makes the analysis difficult due to the large number of interactions predicted. One of such difficult cases is the interactions between Mycobacterial species and human host which is described in the next chapter. Chapter 8 describes the prediction of PPI between human and M.tuberculosis as well as three pathogens closely related to M.tuberculosis. Each of the pathogens has seen to re-emerge due to drug resistance and other causes. M.tuberculosis is becoming a global problem due to the limited number of drugs available to treat TB, which is susceptible to resistance. M.leprae has also shown signs of emergence of drug resistance, whereas C.diptheriae another pathogen studied in this chapter is seen as an emerging pathogen in Eastern Europe and in Indian subcontinent. Nocardial infections have also seen a rise due to the prevalence of AIDS which leads to susceptibility to the Nocardia infections. Thus, there is a need to understand further the pathogens in this important family, in order to better direct drug development. An important area for such endeavors is the mapping of the PPI between the pathogens and the human host. The procedure developed as part of the thesis can be used to predict such interactions. The procedure for prediction of interactions is the same as followed in Chapter 6 and involves identifications of homologues for the pathogen and host proteins among the proteins listed in the two template datasets DIP and iPfam using PSI-BLAST and RPS-BLAST (MulPSSM). In addition to the homology to the proteins involved in PPI, information / prediction on subcellular localization is used to assess biological significance of the interaction. An experimentally derived dataset of exported proteins in the M.tuberculosis was used to supplement the predictions from PSORTb database that provides subcellular localization for bacterial proteins. In order to minimize the number of predictions explored manually and to maximize the biological relevance of predicted interactions,, the predictions were made only for proteins present on the membrane of the pathogen or which are exported into the host. Prediction of interactions between human proteins and the proteins of four pathogens studied revealed that, some of the interactions which were known from earlier experiments were “predicted” by the present procedure. For example, the M.leprae exported Serine protease is known to interact with Ras-like proteins in the human host, and this interaction was ‘predicted’. Among other predicted interactions, several novel interactions have been suggested for proteins important for pathogenesis such as the MPT70 protein of M.tuberculosis which has been predicted to interact with TGFβ associated proteins which could play an important role in the pathogenesis of the disease. Some of the human proteins are known to play important role in pathogenesis, especially the toll-like receptors. A C.diphtheriae protein Mycosin, has been predicted to interact with the toll-like receptors raising the possibility that the Mycosins may play an important role in pathogenesis. Several hypothetical proteins of unknown function in the pathogens have been predicted to interact with human proteins. A few of such cases from M.tuberculosis have been described in the thesis and these proteins are predicted to interact with proteins involved in post-transnational modification in the human host. The prediction of novel interactions along with known interactions in four bacterial species thus points to the fact that the procedure can be used for almost any host-pathogen pair. In the next chapter, the application of the method to three other bacterial species belonging to the Enterobacteriaciae family is presented. Chapter 9 describes the analysis performed on the predicted interactions between human and three pathogens in the Enterobacteriaciae order, namely E.coli, S.enterica and Y.pestis. Each of these pathogens causes severe disease in the human host. Plague is caused by Y.pestis and although the etiology of plague is different than that of E.coli or S.enterica, the genomes of these organisms are closely related. It is known that Y.pestis evolved from an ancestor that probably caused dysentery in mammals, and the bubonic form of plague has recently evolved. Thus, it is useful to get clues to protein-protein interactions between these pathogens and human host. The procedure followed for the analysis is similar to that followed in Chapter 8, and involves homology detection between the pathogen and host proteins with proteins involved in interactions listed in the template databases. The predicted subcellular localization of the pathogen proteins has been obtained from PSORTb database and predictions are made for the human proteins. The pathogens are known to form a type III secretion system and export cytosolic proteins into the host, and thus no protein can be conclusively removed from analysis based on subcellular localization information. In the case of the three pathogens, interactions known to occur with human proteins have been ‘predicted’ in the analysis presented in the chapter. Some of the known interactions involve the proteins SptP, a phosphatase from S.enterica, Ecotin a protease inhibitor from E.coli, and YpkA, a Ser/Thr protein kinase from Y.pestis. Apart from these important virulence proteins, interactions have been predicted for pathogen proteins and human TNF associated proteins, and also human toll-like receptors. Thus, many novel interactions which are found to be biologically meaningful have been predicted using the procedure followed. Apart from novel interactions for proteins important for pathogenesis, many novel interactions involving ‘hypothetical proteins’ in the pathogens have also been predicted which have a suitable biological context. During the course of manual analysis these interactions are found to be significant for pathogenesis. The main outcomes of the entire thesis work are summarized in Chapter 10 which places the work in the larger context of computational biology and its importance in the post-genomic era. The development of algorithms for remote homology detection and its subsequent application for function/structure prediction is highlighted. The second part of the thesis which documents the development and application of a method for prediction of PPI between host and pathogen organisms is an important step forward for exploration of pathogen biology. Supplementary information which is helpful for the understanding of each individual chapter, but which could not be printed in the thesis due to its length, are given in an optical disk attached to this thesis. The material provided in the optical disk, is referred to in appropriate places in the individual chapters.

APA, Harvard, Vancouver, ISO, and other styles

29

Bhar, Anirban. "Application of A Novel Triclustering Method in Analyzing Three Dimensional Transcriptomics Data." Doctoral thesis, 2015. http://hdl.handle.net/11858/00-1735-0000-0022-602C-1.

Full text

APA, Harvard, Vancouver, ISO, and other styles

30

Liu, Yu. "A phylogenomics approach to resolving fungal evolution, and phylogenetic method development." Thèse, 2009. http://hdl.handle.net/1866/5096.

Full text

Abstract:

Bien que les champignons soient régulièrement utilisés comme modèle d'étude des systèmes eucaryotes, leurs relations phylogénétiques soulèvent encore des questions controversées. Parmi celles-ci, la classification des zygomycètes reste inconsistante. Ils sont potentiellement paraphylétiques, i.e. regroupent de lignées fongiques non directement affiliées. La position phylogénétique du genre Schizosaccharomyces est aussi controversée: appartient-il aux Taphrinomycotina (précédemment connus comme archiascomycetes) comme prédit par l'analyse de gènes nucléaires, ou est-il plutôt relié aux Saccharomycotina (levures bourgeonnantes) tel que le suggère la phylogénie mitochondriale? Une autre question concerne la position phylogénétique des nucléariides, un groupe d'eucaryotes amiboïdes que l'on suppose étroitement relié aux champignons. Des analyses multi-gènes réalisées antérieurement n'ont pu conclure, étant donné le choix d'un nombre réduit de taxons et l'utilisation de six gènes nucléaires seulement. Nous avons abordé ces questions par le biais d'inférences phylogénétiques et tests statistiques appliqués à des assemblages de données phylogénomiques nucléaires et mitochondriales. D'après nos résultats, les zygomycètes sont paraphylétiques (Chapitre 2) bien que le signal phylogénétique issu du jeu de données mitochondriales disponibles est insuffisant pour résoudre l'ordre de cet embranchement avec une confiance statistique significative. Dans le Chapitre 3, nous montrons à l'aide d'un jeu de données nucléaires important (plus de cent protéines) et avec supports statistiques concluants, que le genre Schizosaccharomyces appartient aux Taphrinomycotina. De plus, nous démontrons que le regroupement conflictuel des Schizosaccharomyces avec les Saccharomycotina, venant des données mitochondriales, est le résultat d'un type d'erreur phylogénétique connu: l'attraction des longues branches (ALB), un artéfact menant au regroupement d'espèces dont le taux d'évolution rapide n'est pas représentatif de leur véritable position dans l'arbre phylogénétique. Dans le Chapitre 4, en utilisant encore un important jeu de données nucléaires, nous démontrons avec support statistique significatif que les nucleariides constituent le groupe lié de plus près aux champignons. Nous confirmons aussi la paraphylie des zygomycètes traditionnels tel que suggéré précédemment, avec support statistique significatif, bien que ne pouvant placer tous les membres du groupe avec confiance. Nos résultats remettent en cause des aspects d'une récente reclassification taxonomique des zygomycètes et de leurs voisins, les chytridiomycètes. Contrer ou minimiser les artéfacts phylogénétiques telle l'attraction des longues branches (ALB) constitue une question récurrente majeure. Dans ce sens, nous avons développé une nouvelle méthode (Chapitre 5) qui identifie et élimine dans une séquence les sites présentant une grande variation du taux d'évolution (sites fortement hétérotaches - sites HH); ces sites sont connus comme contribuant significativement au phénomène d'ALB. Notre méthode est basée sur un test de rapport de vraisemblance (likelihood ratio test, LRT). Deux jeux de données publiés précédemment sont utilisés pour démontrer que le retrait graduel des sites HH chez les espèces à évolution accélérée (sensibles à l'ALB) augmente significativement le support pour la topologie « vraie » attendue, et ce, de façon plus efficace comparée à d'autres méthodes publiées de retrait de sites de séquences. Néanmoins, et de façon générale, la manipulation de données préalable à l'analyse est loin d’être idéale. Les développements futurs devront viser l'intégration de l'identification et la pondération des sites HH au processus d'inférence phylogénétique lui-même.
Despite the popularity of fungi as eukaryotic model systems, several questions on their phylogenetic relationships continue to be controversial. These include the classification of zygomycetes that are potentially paraphyletic, i.e. a combination of several not directly related fungal lineages. The phylogenetic position of Schizosaccharomyces species has also been controversial: do they belong to Taphrinomycotina (previously known as archiascomycetes) as predicted by analyses with nuclear genes, or are they instead related to Saccharomycotina (budding yeast) as in mitochondrial phylogenies? Another question concerns the precise phylogenetic position of nucleariids, a group of amoeboid eukaryotes that are believed to be close relatives of Fungi. Previously conducted multi-gene analyses have been inconclusive, because of limited taxon sampling and the use of only six nuclear genes. We have addressed these issues by assembling phylogenomic nuclear and mitochondrial datasets for phylogenetic inference and statistical testing. According to our results zygomycetes appear to be paraphyletic (Chapter 2), but the phylogenetic signal in the available mitochondrial dataset is insufficient for resolving their branching order with statistical confidence. In Chapter 3 we show with a large nuclear dataset (more than 100 proteins) and conclusive supports that Schizosaccharomyces species are part of Taphrinomycotina. We further demonstrate that the conflicting grouping of Schizosaccharomyces with budding yeasts, obtained with mitochondrial sequences, results from a phylogenetic error known as long-branch attraction (LBA, a common artifact that leads to the regrouping of species with high evolutionary rates irrespective of their true phylogenetic positions). In Chapter 4, using again a large nuclear dataset we demonstrate with significant statistical support that nucleariids are the closest known relatives of Fungi. We also confirm paraphyly of traditional zygomycetes as previously suggested, with significant support, but without placing all members of this group with confidence. Our results question aspects of a recent taxonomical reclassification of zygomycetes and their chytridiomycete neighbors (a group of zoospore-producing Fungi). Overcoming or minimizing phylogenetic artifacts such as LBA has been among our most recurring questions. We have therefore developed a new method (Chapter 5) that identifies and eliminates sequence sites with highly uneven evolutionary rates (highly heterotachous sites, or HH sites) that are known to contribute significantly to LBA. Our method is based on a likelihood ratio test (LRT). Two previously published datasets are used to demonstrate that gradual removal of HH sites in fast-evolving species (suspected for LBA) significantly increases the support for the expected ‘true’ topology, in a more effective way than comparable, published methods of sequence site removal. Yet in general, data manipulation prior to analysis is far from ideal. Future development should aim at integration of HH site identification and weighting into the phylogenetic inference process itself.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Bioinformatic methods development'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles