Dissertations / Theses on the topic 'Next Generation Sequencin'

To see the other types of publications on this topic, follow the link: Next Generation Sequencin.

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 dissertations / theses for your research on the topic 'Next Generation Sequencin.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Espach, Yolandi. "The detection of mycoviral sequences in grapevine using next-generation sequencing." Thesis, Stellenbosch : Stellenbosch University, 2013. http://hdl.handle.net/10019.1/80025.

Full text
Abstract:
Thesis (MSc)--Stellenbosch University, 2013.
ENGLISH ABSTRACT: Metagenomic studies that make use of next-generation sequencing (NGS) generate large amounts of sequence data, representing the genomes of multiple organisms of which no prior knowledge is necessarily available. In this study, a metagenomic NGS approach was used to detect multiple novel mycoviral sequences in grapevine phloem tissue. Individual sequencing libraries of doublestranded RNA (dsRNA) from two grapevine leafroll diseased (GLD) and three shiraz diseased (SD) vines were sequenced using an Illumina HiScanSQ instrument. Over 3.2 million reads were generated from each of the samples and these reads were trimmed and filtered for quality before being de novo assembled into longer contigs. The assembled contigs were subjected to BLAST (Basic Local Alignment Search Tool) analyses against the NCBI (National Centre for Biotechnology Information) database and classified according to database sequences with which they had the highest identity. Twenty-six putative mycovirus species were identified, belonging to the families Chrysoviridae, Endornaviridae, Narnaviridae, Partitiviridae and Totiviridae. Two of the identified mycoviruses, namely grapevine-associated chrysovirus (GaCV) and grapevine-associated mycovirus 1 (GaMV-1) have previously been identified in grapevine while the rest appeared to be novel mycoviruses not present in the NCBI database. Primers were designed from the de novo assembled mycoviral sequences and used to screen the grapevine dsRNA used for sequencing as well as endophytic fungi isolated from the five sample vines. Only two mycoviruses, related to sclerotinia sclerotiorum partitivirus S and chalara elegans endornavirus 1 (CeEV-1), could be detected in grapevine dsRNA and in fungus isolates. In order to validate the presence of mycoviruses in grapevine phloem tissue, two additional sequencing runs, using an Illumina HiScanSQ and an Applied Biosystems (ABI) SOLiD 5500xl instrument respectively, were performed. These runs generated more and higher quality sequence data than the first sequencing run. Twenty-two of the putative mycoviral sequences initially detected were detected in the subsequent sequence datasets, as well as an additional 29 species not identified in the first HiScanSQ sequence datasets. The samples harboured diverse mycovirus populations, with as many as 19 putative species identified in a single vine. This indicates that the complete virome of diseased grapevines will include a high number of mycoviruses. Additionally, the complete genome of a novel endornavirus, for which we propose the name grapevine endophyte endornavirus (GEEV), was assembled from one of the second HiScanSQ sequence datasets. This is the first complete genome of a mycovirus detected in grapevine. Grapevine endophyte endornavirus has the highest sequence similarity to CeEV-1 and is the same virus that was previously detected in fungus isolates using the mycovirus primers. The virus was detected in two fungus isolates, namely Stemphylium sp. and Aureobasidium pullulans, which is of interest since mycoviruses are not known to be naturally associated with two distinctly different fungus genera. Mycoviral sequence data generated in this study can be used to further investigate the diversity and the effect of mycoviruses in grapevine.
AFRIKAANSE OPSOMMING: Metagenomiese studies, wat gebruik maak van volgende-generasie volgordebepalingstegnologie, het die vermoë om die genetiese samestelling van veelvoudige onbekende organismes te bepaal deurdat dit groot hoeveelhede data genereer. Die bogenoemde tegniek was in hierdie studie aangewend om aantal nuwe mikovirusse in die floëem weefsel van wingerd te identifiseer. Dubbelstring-RNS was gesuiwer vanuit twee druiwestokke met rolbladsiekte en drie met shirazsiekte en Illumina HiScanSQ instrument is gebruik om meer as 3.2 miljoen volgorde fragmente te genereer van elk van die monsters. Lae-kwaliteit volgordes was verwyder en die oorblywende kort volgorde fragmente was saamgestel om langer konstrukte te vorm wat met behulp van BLAST soektogte teen die NCBI databasis geïdentifiseer kon word. Ses-en-twintig mikovirus spesies, wat aan die families Chrysoviridae, Endornaviridae, Narnaviridae, Partitiviridae en Totiviridae behoort, was geïdentifiseer. Twee van die geïdentifiseerde mikovirusse, naamlik grapevine-associated chrysovirus (GaCV) en grapevine-associated mycovirus 1 (GaMV-1), was voorheen al in wingerd gekry terwyl die res nuwe mikovirusse is wat tans nie in die NCBI databasis voorkom nie. Inleiers was ontwerp vanaf die saamgestelde mikovirus basisvolgordes en gebruik om wingerd dubbelstring-RNS sowel as swamme wat vanuit die wingerd geïsoleer is te toets vir die teenwoordigheid van hierdie mikovirusse. Slegs twee mikovirusse, wat onderskeidelik verwant is aan sclerotinia sclerotiorum partitivirus S en chalara elegans endornavirus 1 (CeEV-1), kon deur middel van die inleiers in wingerd en swam isolate geïdentifiseer word. Twee addisionele volgordebepalingsreaksies, wat gebruik gemaak het van die Illumina HiScanSQ en ABI SOLiD 5500xl volgordebepalingsplatforms, was gebruik om die teenwoordigheid van mikovirusse in wingerd te bevestig. Groter hoeveelheid volgorde fragmente was geprodusser wat ook van hoër gehalte was as dié van die eerste volgordebepalingsreaksie. Twee-en-twintig mikovirus spesies kon weer geïdentifiseer word, sowel as 29 spesies wat nie in die eerste HiScanSQ basisvolgorde datastelle gevind was nie. Die wingerdstokke wat in hierdie studie ondersoek was, het hoë diversiteit van mikovirusse bevat aangesien daar tot 19 mikovirus spesies in enkele wingerdstok geïdentifiseer was. Dit is aanduiding dat volledige virus profiele van siek wingerdstokke aantal mikovirusse sal insluit. Die vollengte genoomvolgorde van voorheen onbekende endornavirus was saamgestel vanuit een van die tweede HiScanSQ volgorde datastelle. Dit is die eerste mikovirus wat in wingerd gevind word waarvan die volledige genoomvolgorde bepaal is en ons stel die naam grapevine endophyte endornavirus (GEEV) voor vir hierdie virus. Grapevine endophyte endornavirus is die naaste verwant aan CeEV-1 en is dieselfde virus wat voorheen in wingerd dubbelstring-RNS en swam isolate gevind was deur middel van die mikovirus inleiers. Swam isolate waarin GEEV gevind is, was geïdentifiseer as Stemphylium sp. en Aureobasidium pullulans. Dit is van belang dat GEEV in twee swam isolate gevind is wat aan verskillende genusse behoort aangesien hierdie verskynsel nog nie voorheen in die natuur gevind is nie. Mikovirus nukleiensuurvolgordes wat in hierdie studie bepaal was kan gebruik word in toekomstige studies om die verskeidenheid en impak van mikovirusse in wingerd verder te ondersoek.
National Research Foundation (NRF)
Stellenbosch University
APA, Harvard, Vancouver, ISO, and other styles
2

TROVÃO, Nídia Isabel Sequeira. "Evaluation of next generation sequency protocols for VIH complete genome sequencing." Master's thesis, Instituto de Higiene e Medicina Tropical, 2011. http://hdl.handle.net/10362/51111.

Full text
Abstract:
Vírus da imunodeficiência humana (VIH) é um retrovírus que deu origem a uma pandemia após transmissão zoonótica na primeira metade do século XX. A terapia actual, conhecida como terapia anti-retroviral altamente activa, pode retardar significativamente a progressão da doença. No entanto, apesar de mais de 25 anos de intensa investigação ainda não existe cura disponível. Todos os fármacos anti-retrovirais disponíveis são confrontados com o desafio colocado pelo alto potencial evolutivo do VIH. Isto implica que, independentemente do coquetel de fármacos administrados, resistência aos mesmos pode e vai desenvolver-se. Para gerir esses efeitos negativos, os pacientes devem ser vigiados regularmente, a fim de detectar o desenvolvimento de resistência a fármacos precocemente, de modo a que se possa ajustar oportunamente o regime terapêutico. É de notar que tanto as estirpes resistentes, que evoluíram de novo ou foram adquiridas por meio de transmissão, podem ter impacto negativo no resultado da terapia. Assim sendo, também os pacientes nunca sujeitos a terapia devem ser avaliados antes do início da mesma. Essa triagem geralmente envolve genotipagem da população viral através do sequenciamento directo dos produtos de RT-PCR. Infelizmente, essa abordagem não permite a detecção fiável de estirpes virais presentes em menos de 20% a 25% da população. A associação entre populações minoritárias codificantes de resistência a fármacos com a falha terapêutica, impulsionou as investigações para explorar a plataforma da Roche® 454, como tentativa de ganhar conhecimento mais preciso e em profundidade da população viral. Contudo, tais estudos estão limitados a determinadas regiões genómicas e por outro lado os procedimentos aplicados para fragmentação na plataforma da Roche® 454 requerem elevada quantidade de material primário. Esta tese impõe-se como parte de um projecto mais amplo, comparando os mais recentes protocolos de pré-processamento de amostras para sequenciação completa do genoma de VIH, proveniente de amostras clinicas de plasma e células mononucleares do sangue periférico, e identificação do reservatório mais adequado para detecção de resistência em pacientes recentemente infectados, como segundo objectivo. Assim sendo, este trabalho de investigação foca-se nos aspectos práticos correspondentes ao pré-processamento de amostras antes da geração de dados de sequência. Em detalhe, todos os procedimentos de laboratório, tanto para a estratégia de amplificação de sequência específica e de sequência aleatória foram realizadas. Para o primeiro, geramos 6 amplicões que se sobrepõem para cobrir o genoma inteiro do VIH-1. Depois de misturamos equimolarmente todos os amplicões para cada amostra, foram realizados dois métodos fragmentação enzimática. Estes serão comparados com o método convencional mecânico de fragmentação empregue pela Roche® 454. O sequenciamento com êxito de uma amostra e a conclusão de todos os procedimentos de pré-processamento são promissores para outras aplicações, mas uma avaliação abrangente dos dados de sequenciação a serem gerados é necessário fazer uma escolha informada entre as diferentes abordagens.
Human immunodeficiency virus (HIV) is a retrovirus that gave rise to a worldwide epidemic after its successful zoonotic transmission in the first half of the twentieth century. Current therapy, referred to as Highly Active AntiRetroviral Therapy (HAART), can significantly delay disease progression. However, despite more than 25 years of intensive research there is still no cure available. All available antiretroviral drugs are faced with the insurmountable challenge posed by the high evolutionary potential of HIV. This implies that regardless the administered drug cocktail, drug resistance can and will develop. To manage these negative effects, patients should be screened on a regular basis in order to detect the development of drug resistance in an early phase, so the therapy regimen can be timely adjusted. Importantly, both drug resistant variants that have evolved de novo or were acquired through transmission can negatively impact on therapy outcome. Thus, also therapy-naive patients should be screened before therapy onset. This screening usually involves genotyping of the viral population through the direct sequencing of the RT-PCR products. Unfortunately, this approach does not allow the reliable detection of viral variants present in less then at about 20%-25% of the population. The association of such minor variants harboring drug resistance mutations with therapy failure fueled investigations to exploit the recently developed Roche® 454 NGS platform in an attempt to gain a more accurate in-depth view of the viral population. These inquiries are characterized by two major drawbacks: their focus on limited genomic regions and the need for large amounts of input material characteristic for the proprietary Roche® 454 fragmentation approach. As part of a larger project on the comparison of currently available sample preprocessing protocols for complete genome sequencing of clinical HIV plasma and PBMC samples, and the identification of the most suitable viral reservoir for resistance testing in newly infected patients as a secondary objective, this thesis focuses on the corresponding practical aspects of pre-processing prior to sequence data generation. Specifically, all wet-lab procedures for both the sequence-specific and random priming amplification strategies were carried out. For the former, we generated 6 overlapping amplicons to cover the entire HIV-1 genome. After equimolar pooling of all amplicons for each sample, we performed two enzymatic fragmentation methods. These will be compared to conventional mechanical 454 shearing. The successful sequencing of one sample and the completion of all sample pre-processing procedures is promising for further applications but a comprehensive evaluation of the sequence data to be generated is necessary to make an informed choice among the different approaches.
APA, Harvard, Vancouver, ISO, and other styles
3

Sundquist, Andreas. "Algorithms for next-generation sequencing /." May be available electronically:, 2008. http://proquest.umi.com/login?COPT=REJTPTU1MTUmSU5UPTAmVkVSPTI=&clientId=12498.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Espírito, Ana Cláudia Pereira. "Saccharomycotin transcriptomics by next-generation sequencing." Master's thesis, Universidade de Aveiro, 2015. http://hdl.handle.net/10773/15677.

Full text
Abstract:
Mestrado em Biomedicina Molecular
The non-standard decoding of the CUG codon in Candida cylindracea raises a number of questions about the evolutionary process of this organism and other species Candida clade for which the codon is ambiguous. In order to find some answers we studied the transcriptome of C. cylindracea, comparing its behavior with that of Saccharomyces cerevisiae (standard decoder) and Candida albicans (ambiguous decoder). The transcriptome characterization was performed using RNA-seq. This approach has several advantages over microarrays and its application is booming. TopHat and Cufflinks were the software used to build the protocol that allowed for gene quantification. About 95% of the reads were mapped on the genome. 3693 genes were analyzed, of which 1338 had a non-standard start codon (TTG/CTG) and the percentage of expressed genes was 99.4%. Most genes have intermediate levels of expression, some have little or no expression and a minority is highly expressed. The distribution profile of the CUG between the three species is different, but it can be significantly associated to gene expression levels: genes with fewer CUGs are the most highly expressed. However, CUG content is not related to the conservation level: more and less conserved genes have, on average, an equal number of CUGs. The most conserved genes are the most expressed. The lipase genes corroborate the results obtained for most genes of C. cylindracea since they are very rich in CUGs and nothing conserved. The reduced amount of CUG codons that was observed in highly expressed genes may be due, possibly, to an insufficient number of tRNA genes to cope with more CUGs without compromising translational efficiency. From the enrichment analysis, it was confirmed that the most conserved genes are associated with basic functions such as translation, pathogenesis and metabolism. From this set, genes with more or less CUGs seem to have different functions. The key issues on the evolutionary phenomenon remain unclear. However, the results are consistent with previous observations and shows a variety of conclusions that in future analyzes should be taken into consideration, since it was the first time that such a study was conducted.
A descodificação não-standard do codão CUG na Candida cylindracea levanta uma série de questões sobre o processo evolutivo deste organismo e de outras espécies do subtipo Candida para as quais o codão é ambíguo. No sentido de encontrar algumas respostas procedeu-se ao estudo do transcriptoma de C. cylindracea, comparando o seu comportamento com o de Saccharomyces cerevisiae (descodificador standard) e de Candida albicans (descodificador ambíguo). A caracterização do transcriptoma foi realizada a partir de RNA-seq. Esta metodologia apresenta várias vantagens em relação aos microarrays e a sua aplicação encontra-se em franca expansão. TopHat e Cufflinks foram os softwares utilizados na construção do protocolo que permitiu efectuar a quantificação génica. Cerca de 95% das reads alinharam contra o genoma. Foram analisados 3693 genes, 1338 dos quais com codão start não-standard (TTG/CTG) e a percentagem de genoma expresso foi de 99,4%. Maioritarimente, os genes têm níveis de expressão intermédios, alguns apresentam pouca ou nenhuma expressão e uma minoria é altamente expressa. O perfil de distribuição do codão CUG entre as três espécies é muito diferente, mas pode associar-se significativamente aos níveis de expressão: os genes com menos CUGs são os mais altamente expressos. Porém, o conteúdo em CUG não se relaciona com o nível de conservação: genes mais e menos conservados têm, em média, igual número de CUGs. Os genes mais conservados são os mais expressos. Os genes de lipases corroboram os resultados obtidos para os genes de C. cylindracea em geral, sendo muito ricos em CUGs e nada conservados. A quantidade reduzida de codões CUG que se observa em genes altamente expressos pode dever-se, eventualmente, a um número insuficiente de genes de tRNA para fazer face a mais CUGs sem comprometer a eficiência da tradução. A partir da análise de enriquecimento foi possível confirmar que os genes mais conservados estão associados a funções básicas como tradução, patogénese e metabolismo. Dentro destes, os genes com mais e menos CUGs parecem ter funções diferentes. As questões-chave sobre o fenómeno evolutivo permanecem por esclarecer. No entanto, os resultados são compatíveis com as observações anteriores e são apresentadas várias conclusões que em futuras análises devem ser tidas em consideração, já que foi a primeira vez que um estudo deste tipo foi realizado.
APA, Harvard, Vancouver, ISO, and other styles
5

Kumar, Sujai. "Next-generation nematode genomes." Thesis, University of Edinburgh, 2013. http://hdl.handle.net/1842/7609.

Full text
Abstract:
The first metazoan to be sequenced was a nematode (Caenorhabditis elegans), and understanding the genome of this model organism has led to many insights about all animals. Although eleven nematode genomes have been published so far and approximately twenty more are under way, the vast majority of the genomes of this incredibly diverse phylum remain unexplored. Next-generation sequencing has made it possible to generate large amounts of genome sequence data in a few days at a fraction of the cost of traditional Sanger-sequencing. However, assembling and annotating these data into genomic resources remains a challenge because of the short reads, the quality issues in these kinds of data, and the presence of contaminants and co-bionts in uncultured samples. In this thesis, I describe the process of creating high quality draft genomes and annotation resources for four nematode species representing three of the five major nematode clades: Caenorhabditis sp. 5, Meloidogyne floridensis, Dirofilaria immitis, and Litomosoides sigmodontis. I describe the new approaches I developed for visualising contamination and co-bionts, and I present the details of the robust workflow I devised to deal with the problems of generating low-cost genomic resources from Illumina short-read sequencing. Results: The draft genome assemblies created using the workflow described in this thesis are comparable to the draft nematode genomes created using Sanger sequencing. Armed with these genomes, I was able to answer two evolutionary genomics questions at very different scales. The first question was whether any non-coding elements were deeply conserved at the level of the whole phylum. Such elements had previously been hypothesised to be responsible for the phylum body plan in vertebrates, insects, and nematodes. I used twenty nematode genomes in several whole-genome alignments and concluded that no such elements were conserved across the whole phylum. The second question addressed the origins of the highly destructive plant-parasitic root-knot nematode Meloidogyne incognita. Comparisons with the newly sequenced Meloidogyne floridensis genome revealed the complex hybrid origins of both species, undermining previous assumptions about the rarity of hybrid speciation in animals. Conclusions: This thesis demonstrates the role of next-generation sequencing in democratising genome sequencing projects. Using the sequencing strategies, workflows, and tools described here, one can rapidly create genomic resources at a very low cost, even for unculturable metazoans. These genomes can be used to understand the evolutionary history of a genus or a phylum, as shown.
APA, Harvard, Vancouver, ISO, and other styles
6

Qiao, Dandi. "Statistical Approaches for Next-Generation Sequencing Data." Thesis, Harvard University, 2012. http://dissertations.umi.com/gsas.harvard:10689.

Full text
Abstract:
During the last two decades, genotyping technology has advanced rapidly, which enabled the tremendous success of genome-wide association studies (GWAS) in the search of disease susceptibility loci (DSLs). However, only a small fraction of the overall predicted heritability can be explained by the DSLs discovered. One possible explanation for this ”missing heritability” phenomenon is that many causal variants are rare. The recent development of high-throughput next-generation sequencing (NGS) technology provides the instrument to look closely at these rare variants with precision and efficiency. However, new approaches for both the storage and analysis of sequencing data are in imminent needs. In this thesis, we introduce three methods that could be utilized in the management and analysis of sequencing data. In Chapter 1, we propose a novel and simple algorithm for compressing sequencing data that leverages on the scarcity of rare variant data, which enables the storage and analysis of sequencing data efficiently in current hardware environment. We also provide a C++ implementation that supports direct and parallel loading of the compressed format without requiring extra time for decompression. Chapter 2 and 3 focus on the association analysis of sequencing data in population-based design. In Chapter 2, we present a statistical methodology that allows the identification of genetic outliers to obtain a genetically homogeneous subpopulation, which reduces the false positives due to population substructure. Our approach is computationally efficient that can be applied to all the genetic loci in the data and does not require pruning of variants in linkage disequilibrium (LD). In Chapter 3, we propose a general analysis framework in which thousands of genetic loci can be tested simultaneously for association with complex phenotypes. The approach is built on spatial-clustering methodology, assuming that genetic loci that are associated with the target phenotype cluster in certain genomic regions. In contrast to standard methodology for multi-loci analysis, which has focused on the dimension reduction of data, the proposed approach profits from the availability of large numbers of genetic loci. Thus it will be especially relevant for whole-genome sequencing studies which commonly record several thousand loci per gene.
APA, Harvard, Vancouver, ISO, and other styles
7

Iceton, Gregg. "Next generation sequencing for the water industry." Thesis, University of Newcastle upon Tyne, 2018. http://hdl.handle.net/10443/4187.

Full text
Abstract:
The wastewater industry uses biotechnology to ensure that the discharge of sewage does not have deleterious effects on the environment, yet knowledge of the underlying microbiology is poor. This leads to over engineered and inefficient processes which occasionally and unexpectedly fail. Similarly the impact of sewage on the microbiology of receiving waters is unclear. Recent developments in DNA sequencing have enabled its use where cost was prohibitive. I investigated two applications of Next Generation Sequencing (NGS); activated sludge process monitoring for nitrification, foaming and bulking, and microbial source tracking of faecal contamination in bathing waters. Samples from 32 activated sludge plants (ASPs) were collected and analysed. Cell specific ammonia oxidation rates were calculated using the equation CSAOR =(A x M x 106) XrAOB x MLSS x V where A = grams of ammonia oxidised, M = the number of moles of ammonia in a gram, r = correction factor of 0.9 due to some ammonia removal by adsorption and assimilation (Daims, Ramsing, et al. 2001), MLSS = mixed liquor suspended solids in mg/L and V = the volume of the aeration basin in litres. The CSAOR in nitrifying plants ranged from one to ten mmol/cell/hour, in agreement with other CSAOR studies using alternative techniques. Biological foaming in ASPs occurs when the abundance of filamentous bacteria with hydrophobic surface membranes becomes excessive, though the exact abundance threshold above which foaming occurs has not yet been established. The relative abundance of bacteria associated with foaming was measured for all ASPs which were then categorised as non-foaming, occasionally-foaming or currently-foaming based on operator assessment. There was a significant difference in the abundance of foaming bacteria between non-foaming and occasionally foaming plants (ANOVA p < 0.001), with all non-foaming plants having less than 1% relative abundance of foaming bacteria. These results demonstrate that NGS could be a useful ASP process monitoring tool. A bathing water catchment was sampled throughout a bathing season, including a storm event. Partial least squares analysis showed there was a significant correlation between faecal indicator bacteria and the cumulative apportioned fraction of sources (using Bayesian statistics) in the bathing water community (p < 0.001, r2 = 87%). Faecal host marker analysis detected human contamination upstream of any wastewater network inputs, illustrating the impact of diffuse human pollution. Whole community analysis apportioned the bathing water microbial community to point and diffuse sources, and found that whilst human sources were dominant during storm conditions, in dry weather the primary source of faecal contamination was variable and in some cases could not be attributed to known faecal sources.
APA, Harvard, Vancouver, ISO, and other styles
8

Odelgard, Anna. "Coverage Analysis in Clinical Next-Generation Sequencing." Thesis, Uppsala universitet, Institutionen för biologisk grundutbildning, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-379228.

Full text
Abstract:
With the new way of sequencing by NGS new tools had to be developed to be able to work with new data formats and to handle the larger data sizes compared to the previous techniques but also to check the accuracy of the data. Coverage analysis is one important quality control for NGS data, the coverage indicates how many times each base pair has been sequenced and thus how trustworthy each base call is. For clinical purposes every base of interest must be quality controlled as one wrong base call could affect the patient negatively. The softwares used for coverage analysis with enough accuracy and detail for clinical applications are sparse. Several softwares like Samtools, are able to calculate coverage values but does not further process this information in a useful way to produce a QC report of each base pair of interest. My master thesis has therefore been to create a new coverage analysis report tool, named CAR tool, that extract the coverage values from Samtools and further uses this data to produce a report consisting of tables, lists and figures. CAR tool is created to replace the currently used tool, ExCID, at the Clinical Genomics facility at SciLifeLab in Uppsala and was developed to meet the needs of the bioinformaticians and clinicians. CAR tool is written in python and launched from a terminal window. The main function of the tool is to display coverage breath values for each region of interest and to extract all sub regions below a chosen coverage depth threshold. The low coverage regions are then reported together with region name, start and stop positions, length and mean coverage value. To make the tool useful to as many as possible several settings are possible by entering different flags when calling the tool. Such settings can be to generate pie charts of each region’s coverage values, filtering of the read and bases by quality or write your own entry that will be used for the coverage calculation by Samtools. The tool has been proved to find these low coverage regions very well. Most low regions found are also found by ExCID, the currently used tool, some differences did however occur and every such region was verified by IGV. The coverage values shown in IGV coincided with those found by CAR tool. CAR tool is written to find all low coverage regions even if they are only one base pair long, while ExCID instead seem to generate larger low regions not taking very short low regions into account. To read more about the functions and how to use CAR tool I refer to User instructions in the appendix and on GitHub at the repository anod6351
APA, Harvard, Vancouver, ISO, and other styles
9

Clifford, Harry William. "Next generation sequencing in disease-relevant tissues." Thesis, University of Oxford, 2016. https://ora.ox.ac.uk/objects/uuid:cf2eb0ac-62dd-41c7-896d-35f11f416b82.

Full text
Abstract:
Studies of RNA and the transcriptome are of great importance in providing functional information and unravelling the genetic mechanisms that underlie complex disorders and diseases. With the vast majority of complex disease-associated variants falling outside protein-coding regions of the genome, it is likely that variations in gene expression regulation will be essential to understanding disease aetiology. Information on RNA quantity and splicing isoforms is therefore likely to be crucial for understanding complex pathologies of deleterious genetic variation. The advent of next generation sequencing has allowed the development of an assortment of technologies for interrogating aspects of the genome, one of which is high-throughput RNA sequencing (RNA-Seq). This technology allows rapid, relatively cheap, and accurate quantification of transcripts at a genome-wide scale. By providing a greater number of advantages and fewer caveats than alternative methods of transcriptome quantification, RNA-Seq is a disruptive technology that is likely to supersede most others. Throughout this thesis, I have sought to demonstrate how these advantages assist in revealing significant and novel developmental, noncoding, coding, and alternative isoform information of relevance to disorders and diseases. I take advantage of methods that utilize the truly genome-wide coverage of RNA-Seq, that quantify large numbers of transcripts, and that interrogate novel splicing events. More specifically, I present (i) the identification of novel biomarkers of the various placode-derived vertebrate cranial nerves, (ii) differential gene networks which highlight the genetics of autism intellectual disability co-morbidity, and (iii) differential gene expression underlying a form of severe influenza susceptibility. In addition to these studies, this thesis presents an R package for RNA-Seq time-series experiments, including functionality for efficient model-based clustering, and the integration of gene ontology information for cluster number selection and for subsequent profiling. Overall, this thesis demonstrates how RNA-Seq is a powerful tool for understanding disease aetiology.
APA, Harvard, Vancouver, ISO, and other styles
10

Pyon, Yoon Soo. "Variant Detection Using Next Generation Sequencing Data." Case Western Reserve University School of Graduate Studies / OhioLINK, 2013. http://rave.ohiolink.edu/etdc/view?acc_num=case1347053645.

Full text
APA, Harvard, Vancouver, ISO, and other styles
11

Sala, Claudia. "Ecological modelling for next generation sequencing data." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2013. http://amslaurea.unibo.it/6279/.

Full text
Abstract:
Le tecniche di next generation sequencing costituiscono un potente strumento per diverse applicazioni, soprattutto da quando i loro costi sono iniziati a calare e la qualità dei loro dati a migliorare. Una delle applicazioni del sequencing è certamente la metagenomica, ovvero l'analisi di microorganismi entro un dato ambiente, come per esempio quello dell'intestino. In quest'ambito il sequencing ha permesso di campionare specie batteriche a cui non si riusciva ad accedere con le tradizionali tecniche di coltura. Lo studio delle popolazioni batteriche intestinali è molto importante in quanto queste risultano alterate come effetto ma anche causa di numerose malattie, come quelle metaboliche (obesità, diabete di tipo 2, etc.). In questo lavoro siamo partiti da dati di next generation sequencing del microbiota intestinale di 5 animali (16S rRNA sequencing) [Jeraldo et al.]. Abbiamo applicato algoritmi ottimizzati (UCLUST) per clusterizzare le sequenze generate in OTU (Operational Taxonomic Units), che corrispondono a cluster di specie batteriche ad un determinato livello tassonomico. Abbiamo poi applicato la teoria ecologica a master equation sviluppata da [Volkov et al.] per descrivere la distribuzione dell'abbondanza relativa delle specie (RSA) per i nostri campioni. La RSA è uno strumento ormai validato per lo studio della biodiversità dei sistemi ecologici e mostra una transizione da un andamento a logserie ad uno a lognormale passando da piccole comunità locali isolate a più grandi metacomunità costituite da più comunità locali che possono in qualche modo interagire. Abbiamo mostrato come le OTU di popolazioni batteriche intestinali costituiscono un sistema ecologico che segue queste stesse regole se ottenuto usando diverse soglie di similarità nella procedura di clustering. Ci aspettiamo quindi che questo risultato possa essere sfruttato per la comprensione della dinamica delle popolazioni batteriche e quindi di come queste variano in presenza di particolari malattie.
APA, Harvard, Vancouver, ISO, and other styles
12

Royall, Ariel. "Next-generation Sequencing Methods for Complex Communities." Thesis, University of Oregon, 2017. http://hdl.handle.net/1794/22682.

Full text
Abstract:
Advances in sequencing technology have opened up the possibility of investigating complex communities, but deviations from homogeneity in a sample create challenges in generating and analyzing sequence data. There are two kinds of heterogeneous populations that are addressed in this dissertation: low-frequency sequence variants in a group of largely homogeneous cells and rare members in complex biological communities. It is important to be able to fully characterize the heterogeneity of a sample, as rare genetic variants may provide fuel for selection and rare members of a complex community can play critical roles. Thus, heterogeneity can have important biological roles in everything from ecological community structure to human disease development and progression. In order to assess low-frequency mutations, Paired-End Low Error Sequencing (PELE-Seq) was used. With this method, mutations occurring at frequencies as low as 1 in 10,000 were identified, including some with transcriptional consequences. To investigate rare members of a larger community, an enrichment method was developed to sequence transcripts from host-associated bacteria. Rather than having to sequence the abundant zebrafish host RNA, the enrichment protocol allowed even very minor members of the community to be efficiently sequenced, enabling a first look at the gene expression changes during colonization. This dissertation includes work from previously published co-authored material.
APA, Harvard, Vancouver, ISO, and other styles
13

Randel, Melissa. "New Technology Development for Next-Generation Sequencing." Thesis, University of Oregon, 2017. http://hdl.handle.net/1794/22704.

Full text
Abstract:
Next-Generation Sequencing (NGS) technologies have been evolving at an unparalleled pace. The ability to generate millions of base pairs of data in a short time and at lower cost than previously has led to a dramatic expansion of technologies within the field. This dissertation discusses the development and validation of new methods for assessing genomic variation, dynamic changes in gene expression, high-accuracy sequencing, and analysis of recombination events. By reducing the cost of analyzing many samples for genetic divergence by genotyping the same region of the genome in multiple samples, researchers can pursue investigations on a larger scale. Next-RAD (Nextera fragmentation with Restriction-Associated Digestion) allows analysis of a uniform subset of loci between organisms for comparison of populations by genetic differences with reduced burdens of cost and data analysis. This method was applied to the Anopheles darlingi mosquito to identify three distinct species that were thought to be a uniform population. The lowering cost of large-scale sequencing investigations allows for massively parallel analysis of genomic function in a single assay. Regulation of gene expression in response to stress is a complex process which can only be understood by analyzing many pathways in tandem. A novel method is described which quantifies on a genome-wide scale the expression of millions of randomer tags driven by associated transcriptional enhancers. This method provides novel data in the form of high-resolution analysis of gene regulation. Aside from generating novel data types, another force behind development of new technologies is to improve data quality. One limitation of NGS is the inherent error rate. PELE-Seq (Paired End Low Error Sequencing) was developed to address this problem, by employing completely overlapping paired-end reads as well as a dual barcoding strategy to eliminate incorrect sequences resulting from final library amplification. This new tool improves data quality dramatically. Finally, the rapid expansion of tools necessitates the identification of new applications for these technologies. To this end, 10x Genomics Linked-Read sequencing was employed to identify recombination events in multiple species. The haplotype-resolved nature of the data generated from such assays has many promising applications.
APA, Harvard, Vancouver, ISO, and other styles
14

BERETTA, STEFANO. "Algorithms for next generation sequencing data analysis." Doctoral thesis, Università degli Studi di Milano-Bicocca, 2013. http://hdl.handle.net/10281/42355.

Full text
Abstract:
Two of the main bioinformatics fields that have been influenced by the introduction of the Next-Generation Sequencing (NGS) techniques are transcriptomics and metagenomics. The adoption of these new methods to sequence DNA/RNA molecules has drastically changed the kind and also the amount of produced data. The effect is that all the developed algorithms and tools working on traditional data cannot be applied on NGS data. For this reason, in this thesis we face two central problems in two fields: transcriptmics and metagenomics. The first one regards the characterization of the Alternative Splicing (AS) events starting from NGS sequences coming from transcripts (called RNA-Seq reads). To this aim we have modeled the structure of a gene, with respect to the AS variations occurring in it, by using a graph representation (called splicing graph). More specifically, we have identified the conditions for the correct reconstruction of the splicing graph, starting from RNA-Seq data, and we have realized an algorithm for its construction. Moreover, our method is able to correct reconstruct the splicing graph even when the input RNA-Seq reads do not respect the identified conditions. Finally, we have performed an experimental analysis of our procedure in order to validated the obtained results. The second problem we face in this thesis is the assignment of NGS read, coming from a metagenomic sample, to a reference taxonomic tree, in order to assess the composition of the sample and classify the unknown micro-organisms in it. This is done by aligning the reads to the taxonomic tree and then choosing (when there are more valid matches) the node that best represents the read. This choice is based on the calculation of a Penalty Score (PS) function for all the nodes descending from the lowest common ancestor of the valid matches in the tree. We have realized an optimal algorithm for the computation of the PS function, based on the so called skeleton tree, which improve the performances of the taxonomic assignment procedure. We have also implemented the method by using more efficient data structures, with respect to the one used in the previous version of the procedure. Finally, we have offered the possibility to switch among different taxonomies by developing a method to map trees and translate the input alignments.
APA, Harvard, Vancouver, ISO, and other styles
15

Suren, Haktan. "Sequence capture as a tool to understand the genomic basis for adaptation in angiosperm and gymnosperm trees." Diss., Virginia Tech, 2017. http://hdl.handle.net/10919/86383.

Full text
Abstract:
Forest trees represent a unique group of organisms combined with ecological and economic importance. Owing to their random mating system and widespread geographical distribution, they harbor abundance genetic variation both within and among populations. Despite their importance, research in forest trees has been underrepresented majorly due to their large and complex genome and scarce funding. However, recent climate change and other associated problems such as insect outbreaks, diseases and stress related damages have urged scientists to focus more on trees. Furthermore, the advent in high-throughput sequencing technologies have allowed trees to be sequenced and used as reference genome, which provided deeper understanding between genotype and environment. Whole genome sequencing is still not possible for organisms having large genomes including most tree species, and it is still not feasible economically for population genomic studies which require sequencing hundreds of samples. To get around this problem, genomic reduction is required. Sequence capture has been one of the genomic reduction techniques enabled studying the subset of the DNA of interest. In this paper, our primary goal is to outline challenges, provide guidance about the utility of sequence capture in trees, and to leverage such data in genome-wide association analyses to find the genetic variants that underlie complex, adaptive traits in spruce and pine, as well as poplar. Results of this research will facilitate bridging the genomic information gap between trees and other organisms. Moreover, it will provide better understanding how genetic variation governs phenotype in trees, which will facilitate both marker assisted selection for improved traits as well as provide guidance to determine forest management strategies for reforestation to mitigate the effects of climate change.
Ph. D.
APA, Harvard, Vancouver, ISO, and other styles
16

Tork, Bassam A. "VIRAL QUASISPECIES RECONSTRUCTION USING NEXT GENERATION SEQUENCING READS." Digital Archive @ GSU, 2013. http://digitalarchive.gsu.edu/cs_diss/77.

Full text
Abstract:
The genomic diversity of viral quasispecies is a subject of great interest, especially for chronic infections. Characterization of viral diversity can be addressed by high-throughput sequencing technology (454 Life Sciences, Illumina, SOLiD, Ion Torrent, etc.). Standard assembly software was originally designed for single genome assembly and cannot be used to assemble and estimate the frequency of closely related quasispecies sequences. This work focuses on parsimonious and maximum likelihood models for assembling viral quasispecies and estimating their frequencies from 454 sequencing data. Our methods have been applied to several RNA viruses (HCV, IBV) as well as DNA viruses (HBV), genotyped using 454 Life Sciences amplicon and shotgun methods.
APA, Harvard, Vancouver, ISO, and other styles
17

Busby, Michele Anne. "Measuring Gene Expression With Next Generation Sequencing Technology." Thesis, Boston College, 2012. http://hdl.handle.net/2345/3145.

Full text
Abstract:
Thesis advisor: Gabor Marth
While a PhD student in Dr. Gabor Marth's laboratory, I have had primary responsibility for two projects focused on using RNA-Seq to measure differential gene expression. In the first project we used RNA-Seq to identify differentially expressed genes in four yeast species and I analyzed the findings in terms of the evolution of gene expression. In this experiment, gene expression was measured using two biological replicates of each species of yeast. While we had several interesting biological findings, during the analysis we dealt with several statistical issues that were caused by the experiment's low number of replicates. The cost of sequencing has decreased rapidly since this experiment's design and many of these statistical issues can now practically be avoided by sequencing a greater number of samples. However, there is little guidance in the literature as to how to intelligently design an RNA-Seq experiment in terms of the number of replicates that are required and how deeply each replicate must be sequenced. My second project, therefore, was to develop Scotty, a web-based program that allows users to perform power analysis for RNA-Seq experiments. The yeast project resulted in a highly accessed first author publication in BMC Genomics in 2011. I have structured my dissertation as follows: The first chapter, entitled General Issues in RNA-Seq, is intended to synthesize the themes and issues of RNA-Seq statistical analysis that were common to both papers. In this section, I have discussed the main findings from the two papers as they relate to analyzing RNA-Seq data. Like the Scotty application, this section is designed to be "used" by wet-lab biologists who have a limited background in statistics. While some background in statistics would be required to fully understand the following chapters, the essence of this background can be gained by reading this first chapter. The second and third chapters contain the two papers that resulted from the two RNA-Seq projects. Each chapter contains both the original manuscript and original supplementary methods and data section. Finally, I include brief summaries of my contributions to the two papers on which I was a middle author. The first was a functional analysis of the genomic regions affected by mobile element insertions as a part of Chip Stewart's paper with the 1000 Genome Consortium. This paper was published in Plos Genetics. The second was a cluster analysis of microarray gene expression in Toxoplasma gondii, which was included as part of Alexander Lorestani et al.'s paper, Targeted proteomic dissection of Toxoplasma cytoskeleton sub-compartments using MORN1. This paper is currently under review. The yeast project was a collaborative effort between Jesse Gray, Michael Springer, and Allen Costa at Harvard Medical School, Jeffery Chuang here at Boston College, and members of the Marth lab. Jesse Gray conceived of the project. While I wrote the draft for the manuscript, many people, particularly Gabor Marth, provided substantial guidance on the actual text. I conceived of and implemented Scotty and wrote its manuscript with only editorial assistance from my co-authors. I produced all figures for the two manuscripts. Chip Stewart provided extensive guidance and mentorship to me on all aspects of my statistical analyses for both projects
Thesis (PhD) — Boston College, 2012
Submitted to: Boston College. Graduate School of Arts and Sciences
Discipline: Biology
APA, Harvard, Vancouver, ISO, and other styles
18

Laver, Thomas William. "Evaluating metagenomic quantifications from next-generation sequencing data." Thesis, University of Exeter, 2014. http://hdl.handle.net/10871/17439.

Full text
Abstract:
Molecular profiling is exploiting the unprecedented power of next generation DNA sequencing to illuminate the microbial diversity of the natural world. The composition of microbiomes has been implicated as an important factor in human health and the function of ecosystems. It is thus of great importance that measurements of microbiomes are accurate and reliable, and moreover it is essential that the accuracy and reliability of such measurements are well understood. This project sought to provide assessments of the accuracy and precision of measurements made by 16S rDNA amplicon sequencing and whole genome shotgun sequencing, as well as investigate the impact of different experimental and bioinformatics choices on quantitative measurements. To address these aims next generation sequencing data from a well quantified metagenomic control material was utilized. Good precision and accuracy were recorded for 16S primer pairs which were perfectly complementary to the target organisms. Where primers were not perfectly complementary to an organism, its abundance was underestimated. Whole genome shotgun sequencing demonstrated very high levels of precision, with a mean coefficient of variation of 2%, and showed good agreement with the 16S rDNA amplicon sequencing using primer pairs optimized specifically for the target species. Small changes in relative species abundance (less than three fold) should be treated with caution as this thesis demonstrated that sequencing results for species can vary by this amount from digital polymerase chain reaction results. Issues with publically available 16S rDNA sequence databases contribute to a lack of taxonomic resolution; taxa measured at low abundance are also likely to be artifacts of the analysis. In addition to the established sequencing platforms, this thesis also investigated the performance of a promising new experimental DNA sequencing platform developed by Oxford Nanopore Technologies (ONT). The ONT MinION, has an error rate of greater than 40% and, while it produces exceptionally long reads, it is not yet suitable for quantitative metagenomics. This thesis also demonstrated that the use of control materials in molecular profiling is important to verify findings and to understand the impact different experimental and bioinformatics choices have on measurements of the microbiome.
APA, Harvard, Vancouver, ISO, and other styles
19

Ljungström, Viktor. "Exploring next-generation sequencing in chronic lymphocytic leukemia." Doctoral thesis, Uppsala universitet, Experimentell och klinisk onkologi, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-302026.

Full text
Abstract:
Next-generation sequencing (NGS) techniques have led to major breakthroughs in the characterization of the chronic lymphocytic leukemia (CLL) genome with discovery of recurrent mutations of potential prognostic and/or predictive relevance. However, before NGS can be introduced into clinical practice, the precision of the techniques needs to be studied in better detail. Furthermore, much remains unknown about the genetic mechanisms leading to aggressive disease and resistance to treatment. Hence, in Paper I, the technical performance of a targeted deep sequencing panel including 9 genes was evaluated in 188 CLL patients. We were able to validate 143/155 (92%) selected mutations through Sanger sequencing and 77/82 mutations were concordant in a second targeted sequencing run, indicating that the technique can be introduced in clinical practice. In Paper II we screened 18 NF-κB pathway genes in 315 CLL patients through targeted deep sequencing which revealed a recurrent 4 base-pair deletion in the NFKBIE gene. Screening of NFKBIE in 377 additional cases identified the mutation in ~6% of all CLL patients. We demonstrate that the lesion lead to aberrant NF-κB signaling through impaired interaction with p65 and is associated with unfavorable clinical outcome. In Paper III we sought to delineate the genetic lesions that leads to relapse after fludarabine, cyclophosphamide, and rituximab treatment. Through whole-exome sequencing of pre-treatment and relapse samples from 41 cases we found evidence of frequent selection of subclones harboring driver mutations and subsequent clonal evolution following treatment. We also detected mutations in the ribosomal protein RPS15 in 8 cases (19.5%) and characterization of the mutations through functional assays point to impaired p53 regulation in cells with mutated RPS15. Paper IV aimed at characterizing 70 patients assigned to three major subsets (#1, #2, and #4) through whole-genome sequencing. Besides recurrent exonic driver mutations, we report non-coding regions significantly enriched for mutations in subset #1 and #2 that may facilitate future molecular studies. Collectively, this thesis supports the potential of targeted sequencing for mutational screening of CLL in clinical practice, provides novel insight into the pathobiology of aggressive CLL, and demonstrates the clinical outcome and cellular effects of NFKBIE and RPS15 mutations.
APA, Harvard, Vancouver, ISO, and other styles
20

Cui, Hongzhu. "In Silico Edgetic Profiling and Network Analysis of Human Genetic Variants, with an Application to Disease Module Detection." Digital WPI, 2020. https://digitalcommons.wpi.edu/etd-dissertations/596.

Full text
Abstract:
In the past several decades, Next Generation Sequencing (NGS) methods have produced large amounts of genomic data at the exponentially increasing rate. It has also enabled tremendous advancements in the quest to understand the molecular mechanisms underlying human complex traits. Along with the development of the NGS technology, many genetic variation and genotype–phenotype databases and functional annotation tools have been developed to assist scientists to better understand the intricacy of the data. Together, the above findings bring us one step closer towards mechanistic understanding of the complex phenotypes. However, it has rarely been possible to translate such a massive amount of information on mutations and their associations with phenotypes into biological or therapeutic insights, and the mechanisms underlying genotype-phenotype relationships remain partially explained. Meanwhile, increasing evidence shows that biological networks are essential, albeit not sufficient, for the better understanding of these mechanisms. Among them, protein- protein interaction (PPI) network studies have attracted perhaps most attention. Our overarching goal of this dissertation is to (i) perform a systematic study to investigate the role of pathogenic human genetic variant in the interactome; (ii) examine how common population-specific SNVs affect PPI network and how they contribute to population phenotypic variance and disease susceptibility; and (iii) develop a novel framework to incorporate the functional effect of mutations for disease module detection. In this dissertation, we first present a systematic multi-level characterization of human mutations associated with genetic disorders by determining their individual and combined interaction-rewiring effects on the human interactome. Our in-silico analysis highlights the intrinsic differences and important similarities between the pathogenic single nucleotide variants (SNVs) and frameshift mutations. Functional profiling of SNVs indicates widespread disruption of the protein-protein interactions and synergistic effects of SNVs. The coverage of our approach is several times greater than the recently published experimental study and has the minimal overlap with it, while the distributions of determined edgotypes between the two sets of profiled mutations are remarkably similar. Case studies reveal the central role of interaction- disrupting mutations in type 2 diabetes mellitus and suggest the importance of studying mutations that abnormally strengthen the protein interactions in cancer. Second, aided with our SNP-IN tool, we performed a systematic edgetic profiling of population specific non-synonymous SNVs and interrogate their role in the human interactome. Our results demonstrated that a considerable amount of normal nsSNVs can cause disruptive impact to the interactome. We also showed that genes enriched with disruptive mutations associated with diverse functions and have implications in various diseases. Further analysis indicates that distinct gene edgetic profiles among major populations can help explain the population phenotypic variance. Finally, network analysis reveals phenotype-associated modules are enriched with disruptive mutations and the difference of the accumulated damage in such modules may suggest population-specific disease susceptibility. Lastly, we propose and develop a computational framework, Discovering most IMpacted SUbnetworks in interactoMe (DIMSUM), which enables the integration of genome-wide association studies (GWAS) and functional effects of mutations into the protein–protein interaction (PPI) network to improve disease module detection. Specifically, our approach incorporates and propagates the functional impact of non- synonymous single nucleotide polymorphisms (nsSNPs) on PPIs to implicate the genes that are most likely influenced by the disruptive mutations, and to identify the module with the greatest functional impact. Comparison against state-of-the-art seed-based module detection methods shows that our approach could yield modules that are biologically more relevant and have stronger association with the studied disease. With the advancement of next-generation sequencing technology that drives precision medicine, there is an increasing demand in understanding the changes in molecular mechanisms caused by the specific genetic variation. The current and future in-silico edgotyping tools present a cheap and fast solution to deal with the rapidly growing datasets of discovered mutations. Our work shows the feasibility of a large- scale in-silico edgetic study and revealing insights into the orchestrated play of mutations inside a complex PPI network. We also expect for our module detection method to become a part of the common toolbox for the disease module analysis, facilitating the discovery of new disease markers.
APA, Harvard, Vancouver, ISO, and other styles
21

Nafisinia, Michael. "Gene Discovery for Genetic Disorders using Next Generation Sequencing and Functional Genomics." Thesis, The University of Sydney, 2017. http://hdl.handle.net/2123/16867.

Full text
Abstract:
The focus of this thesis was the identification of the genetic bases of Mendelian and mitochondrial respiratory chain disorders in a cohort of paediatric patients, to better understand their pathogenesis. Genetic disorders are caused by mutations in the mitochondrial or nuclear genomes and may be influenced to a lesser or greater degree by environmental factors. To date, nearly 3000 genes have been implicated in ~ 4,400 Mendelian phenotypes. However, despite this, the genetic bases for almost 50% of all known Mendelian phenotypes remains to be definitively elucidated. Mitochondrial respiratory chain disorders are the most common group of inborn errors of metabolism and can be caused by mutations in either mitochondrial DNA or nuclear DNA. The genetic heterogeneity of these disorders makes diagnosis challenging, adding distress to families already dealing with the trauma of an extremely ill family member. Mutations in mitochondrial DNA or nuclear DNA genes can result in impaired function of the respiratory chain causing broad symptoms including neuropathy, cardiomyopathy, muscle weakness, fatigue, cognitive impairment, visual and auditory impairment, to name a few. Despite the advances in gene screening techniques, the genetic bases of many respiratory chain disorders remains unidentified. This study had two phases: identification of the likely disease-causing variants in paediatric patients with suspected Mendelian or mitochondrial inherited disorders due to mitochondrial or nuclear DNA mutations, and implementation of functional studies to confirm pathogenicity and gain insights into possible disease mechanisms of the identified variants. Nine patients with suspected Mendelian disorders and two patients with a suspected mitochondrial disorder were studied in this project. Using whole exome sequencing (WES) in collaboration with other institutes or groups within Australia and overseas, we were able to efficiently identify the genetic basis of Mendelian and mitochondrial respiratory chain disorders in the majority of the paediatric patients studied in this project. In collaboration with bioinformaticians and clinician colleagues, we implemented sophisticated filtering pipelines, with candidate causative variants being narrowed down from the very expansive WES data. We then performed functional assays to determine the functional impact of the identified variants. These functional assays included immunoblotting and blue native polyacrylamide gel electrophoresis to measure protein expression and assembly. Further, in the case of the mitochondrial respiratory chain disorders, we measured the effect of the variant on the protein levels of mitochondrial respiratory chain complexes in patient fibroblast samples. We also measured respiratory chain complex enzyme activities using dipstick assays (for complex I and complex IV) or traditional spectrophotometric assays (for complexes I, II, III, and IV). In this PhD project, we have successfully identified four disease variants in NDUFV1 (OMIM: 161015), RARS (OMIM: 107820), GARS (OMIM: 600287), and PIGN (OMIM: 606097) and a novel variant NOX4 (OMIM: 605261) that may act as modifier in causing death in the proband. NDUFV1 encodes a 51 kDa subunit of the NADH: ubiquinone oxidoreductase complex I and was the cause of Leigh disease in one patient. Both RARS and GARS are part of the aminoacyl-tRNA synthetase family, encoding arginyl-tRNA synthetase and glycyl-tRNA synthetase proteins respectively, with mutations in the former causing a hypomyelination disorder very similar to Pelizaeus–Merzbacher disease in three patients, while the latter caused a mitochondrial respiratory chain disorder in one patient. PIGN encodes a protein that is involved in glycosylphosphatidylinositol (GPI)-anchor biosynthesis and was the cause of a neurological disorder in two patients. The NOX4 gene encodes the catalytic subunit of the NADPH oxidase complex that catalyses the reduction of molecular oxygen, mainly to hydrogen peroxide. It is possible that the NOX4 genotype we have identified may act as a modifier for the, as yet, unidentified primary genetic cause in our patient. The findings of this thesis highlight the importance of a multidisciplinary and multipronged approach to the identification of causative variants in patients with suspected Mendelian or mitochondrial respiratory chain disorders. These approaches include careful delineation of the clinical features, biochemical testing, histological analysis, and genetic investigations including WES, coupled to laboratory-based functional studies. Identification of the underlying genetic causes and understanding the resulting pathogenesis of these disorders may point to existing therapies or the development of novel therapies, and provide critical information to genetic counsellors allowing them to more effectively advise the parents of affected individuals for future family planning.
APA, Harvard, Vancouver, ISO, and other styles
22

DAL, MOLIN MATTEO. "Identification and validation of DNA sequence variants in cancer predisposition genes by next generation sequencing approaches." Doctoral thesis, Università degli studi di Pavia, 2022. https://hdl.handle.net/11571/1468217.

Full text
APA, Harvard, Vancouver, ISO, and other styles
23

DAL, MOLIN MATTEO. "Identification and validation of DNA sequence variants in cancer predisposition genes by next generation sequencing approaches." Doctoral thesis, Università degli studi di Pavia, 2022. https://hdl.handle.net/11571/1468215.

Full text
APA, Harvard, Vancouver, ISO, and other styles
24

Kislyuk, Andrey O. "Algorithm development for next generation sequencing-based metagenome analysis." Diss., Georgia Institute of Technology, 2010. http://hdl.handle.net/1853/42779.

Full text
Abstract:
We present research on the design, development and application of algorithms for DNA sequence analysis, with a focus on environmental DNA (metagenomes). We present an overview and primer on algorithm development for bioinformatics of metagenomes; work on frameshift detection in DNA sequencing data; work on a computational pipeline for the assembly, feature prediction, annotation and analysis of bacterial genomes; work on unsupervised phylogenetic clustering of metagenomic fragments using Markov Chain Monte Carlo methods; and work on estimation of bacterial genome plasticity and diversity, potential improvements to the measures of core and pan-genomes.
APA, Harvard, Vancouver, ISO, and other styles
25

Dupuis, Sandoval Fabien. "Exploring optimal snoRNA profiling using Next Generation Sequencing methods." Mémoire, Université de Sherbrooke, 2018. http://hdl.handle.net/11143/11931.

Full text
Abstract:
Abstract: Recent advances in Next-Generation Sequencing protocols have opened a variety of ways to generate data. However, each newly developed methodology is most suited to represent a certain phenomenon or molecule. The object of this analysis is to identify the most appropriate way to generate and process data to study the snoRNAs, or small nucleolar RNA. Recently, snoRNAs have been revealed as taking part in a variety of unexpected alternative functions such as splicing, resistance to oxidative shock and chromatin unwinding. Finding a method to generate and treat a large quantity of data containing snoRNAs and their potential interactors could highlight some of their unexplored roles within the cell. To tackle the problem, a new protocol was put forward. This new pipeline relies on a reverse transcriptase isolated from a bacterial group II intron which boasts a better representation of structured small RNAs such as tRNAs and snoRNAs. Indeed, when compared to data created by using the standard small RNA preparation protocol, the sequencing data generated through the group II intron retrotranscriptase gives a much fairer representation. These improvements are also present in the bioinformatics pipeline. The workflow was changed to facilitate the detection of ncRNAs. These modifications rescue millions of reads, further increasing the power of the analysis. Ultimately, such corrections increase the predictive power of sequencing data.
Des avancées récentes dans le domaine du séquençage de prochaine génération ont ouvert une panoplie de façons de générer des données. Toutefois, chaque nouvelle méthode dévelopée est souvent appropriée à la caractérisation d’un seul type de phénomène ou de molécules. L’objectif de cette analyse est d’identifier la manière la plus appropriée de générer et traiter les données pour étudier les petits ARNs nucléolaires, snoRNAs. Récemment, ceux-ci ont été révélés comme des acteurs dans une variété de fonctions alternatives comme l’épissage alternatif, la résistance au choc oxidatif et l’état de la chromatine. Il est donc impératif de trouver une méthode qui puisse traiter une large quantité de données contenant les snoRNAs et leurs intéracteurs pour découvrir les rôles encore inexplorés des snoRNAs. Dans cette optique, un nouveau protocole a été élaboré. Cette nouvelle suite d’analyses s’appuie sur une reverse transcriptase isolée d’un intron de groupe II bactérien qui affiche une meilleure représentation des petits ARNs structurés comme les tRNAs et les snoRNAs. En effet, quand les données générées à travers la méthode de préparation des libraries pour petits ARNs standard est comparée à celle basée sur la reverse transcriptase bactérienne, cette dernière donne une meilleure représentation du compte des espèces. Ces avancées sont aussi présentes dans la méthode d’analyse informatique. La suite d’outils a été modifiée afin de permettre une meilleure détection des petits ARN non-codants. Ces modifications permettent de récupérer des millions de lectures par ensemble de données ce qui augmente le pouvoir prédictif de l’analyse.
APA, Harvard, Vancouver, ISO, and other styles
26

Forster, Michael [Verfasser]. "Translating Next-Generation-Sequencing into Precision Medicine / Michael Forster." Kiel : Universitätsbibliothek Kiel, 2019. http://d-nb.info/1182989748/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
27

Wang, Yi, and 王毅. "Binning and annotation for metagenomic next-generation sequencing reads." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2014. http://hdl.handle.net/10722/208040.

Full text
Abstract:
The development of next-generation sequencing technology enables us to obtain a vast number of short reads from metagenomic samples. In metagenomic samples, the reads from different species are mixed together. So, metagenomic binning has been introduced to cluster reads from the same or closely related species and metagenomic annotation is introduced to predict the taxonomic information of each read. Both metagenomic binning and annotation are critical steps in downstream analysis. This thesis discusses the difficulties of these two computational problems and proposes two algorithmic methods, MetaCluster 5.0 and MetaAnnotator, as solutions. There are six major challenges in metagenomic binning: (1) the lack of reference genomes; (2) uneven abundance ratios; (3) short read lengths; (4) a large number of species; (5) the existence of species with extremely-low-abundance; and (6) recovering low-abundance species. To solve these problems, I propose a two-round binning method, MetaCluster 5.0. The improvement achieved by MetaCluster 5.0 is based on three major observations. First, the short q-mer (length-q substring of the sequence with q = 4, 5) frequency distributions of individual sufficiently long fragments sampled from the same genome are more similar than those sampled from different genomes. Second, sufficiently long w-mers (length-w substring of the sequence with w ≈ 30) are usually unique in each individual genome. Third, the k-mer (length-k substring of the sequence with k ≈ 16) frequencies from reads of a species are usually linearly proportional to that of the species’ abundance. The metagenomic annotation methods in the literatures often suffer from five major drawbacks: (1) unable to annotate many reads; (2) less precise annotation for reads and more incorrect annotation for contigs; (3) unable to deal with novel clades with limited references genomes well; (4) performance affected by variable genome sequence similarities between different clades; and (5) high time complexity. In this thesis, a novel tool, MetaAnnotator, is proposed to tackle these problems. There are four major contributions of MetaAnnotator. Firstly, instead of annotating reads/contigs independently, a cluster of reads/contigs are annotated as a whole. Secondly, multiple reference databases are integrated. Thirdly, for each individual clade, quadratic discriminant analysis is applied to capture the similarities between reference sequences in the clade. Fourthly, instead of using alignment tools, MetaAnnotator perform annotation using k-mer exact match which is more efficient. Experiments on both simulated datasets and real datasets show that MetaCluster 5.0 and MetaAnnotator outperform existing tools with higher accuracy as well as less time and space cost.
published_or_final_version
Computer Science
Doctoral
Doctor of Philosophy
APA, Harvard, Vancouver, ISO, and other styles
28

Bowen, Margot Elizabeth. "Applying Next Generation Sequencing to Skeletal Development and Disease." Thesis, Harvard University, 2013. http://dissertations.umi.com/gsas.harvard:11233.

Full text
Abstract:
Next Generation Sequencing (NGS) technologies have dramatically increased the throughput and lowered the cost of DNA sequencing. In this thesis, I apply these technologies to unresolved questions in skeletal development and disease. Firstly, I use targeted re-sequencing of genomic DNA to identify the genetic cause of the cartilage tumor syndrome, metachondromatosis (MC). I show that the majority of MC patients carry heterozygous loss-of-function mutations in the PTPN11 gene, which encodes a phosphatase, SHP2, involved in many signaling pathways. Furthermore, I show that cartilage lesions in MC patients likely arise following somatic second-hit mutations in PTPN11. Secondly, I use RNA-seq to identify gene expression changes that occur following genetic inactivation of Ptpn11 in mouse chondrocyte cultures. I show that chondrocytes lacking Ptpn11 fail to properly undergo terminal differentiation and instead continue to express genes associated with earlier stages of chondrocyte maturation. I validate these findings in vivo by examining markers of specific chondrocyte maturation stages in the vertebral growth plates of mice following postnatal mosaic inactivation of Ptpn11. Together, my results provide insight into the molecular mechanisms underlying the initiation and growth of cartilage tumors. In the third component of my thesis, I develop a method to map and clone zebrafish mutations by performing whole genome sequencing on pooled DNA. I apply this method to zebrafish mutants identified in a mutagenesis screen for adult phenotypes, including skeletal phenotypes, and determine that a nonsense mutation in bmp1a underlies the craniofacial phenotype in the wdd mutant. In summary, I show that NGS technologies can be successfully utilized to firstly identify the genetic cause of a human skeletal disorder, secondly investigate the molecular mechanisms regulating the maturation of skeletal cells, and thirdly expedite the process of mapping and cloning zebrafish mutants with skeletal phenotypes. Altogether, my research provides insight into the pathways and processes regulating skeletal development and disease.
APA, Harvard, Vancouver, ISO, and other styles
29

Brown, J. R. "Next generation sequencing to understand norovirus in immunocompromised children." Thesis, University College London (University of London), 2017. http://discovery.ucl.ac.uk/1558811/.

Full text
Abstract:
Norovirus is a leading cause of gastroenteritis worldwide, causing self-limited vomiting and diarrhoea in immunocompetent people and chronic infections with significant morbidity in immunocompromised patients. Data presented in this thesis uses deep sequencing to increase our understanding of norovirus in a hospital paediatric population with a large proportion of immunocompromised patients. Real-time PCR reveals that norovirus is the most prevalent gastrointestinal virus in this population, causing infection with a higher viral titre than other gastrointestinal viruses. Norovirus is most common in immunocompromised patients and is the virus most commonly associated with chronic infections, which occur primarily in immunocompromised patients. The performance of a novel method for deep sequencing norovirus full genomes is described; this overcomes the limitations of previously published methods and achieves full genomes with >12000-fold read depth regardless of genotype or viral titre. This method is applied to sequence the complete genomes of every new norovirus case at Great Ormond Street Hospital (GOSH) over a 19 month period. Full genomes reveal a broad range of circulating genotypes, more akin to genotypes circulating in the community than those typically seen in hospitals. Phylogenetic analysis shows that the majority (69%) of cases are not acquired from another patient. This suggests multiple introductions of different norovirus strains, with limited nosocomial transmission. Full genome sequencing of longitudinally collected samples shows that chronic norovirus infections may involve super- or re-infection with a different genotype, although this does not affect the duration of infection. Deep sequencing is used to investigate changes in the norovirus intra-host mutation frequency in chronically infected immunosuppressed patients who were and were not treated with oral ribavirin, revealing a possible role for ribavirin in the treatment of chronic norovirus infections. However interpretation of in vivo data is confounded by fluctuating mutation frequencies observed over time in all patients.
APA, Harvard, Vancouver, ISO, and other styles
30

Wasylenko, Theresa Anne. "Understanding Huntington's Disease pathogenesis using next generation sequencing analyses." Thesis, Massachusetts Institute of Technology, 2015. http://hdl.handle.net/1721.1/103260.

Full text
Abstract:
Thesis: Ph. D., Massachusetts Institute of Technology, Department of Biology, February 2016.
Cataloged from PDF version of thesis. "February 2015."
Includes bibliographical references (pages 215-240).
Huntington's disease is one of nine expanded (CAG) repeat disorders. The expansion in Huntington's disease lies in the first exon of the huntingtin (HTT) gene and is pathogenic when (CAG)>/= 40 . Individuals with Huntington's disease develop motor, cognitive, and psychiatric symptoms in adulthood. These symptoms progress for approximately 15 years at which time they become fatal. The clinical manifestation of HD largely results from the extreme degeneration of neurons in the striatum and cortex. The HTT gene encodes the huntingtin (HTT) protein. Over the years, researchers have developed a rich understanding of the consequences of loss of wildtype HTT function, gain of toxic mutant HTT function, and mutant HTT RNA toxicity. However, the mechanisms through which pathology develops are still largely ambiguous. Given the widespread involvement of HTT in cellular processes, next generation DNA sequencing technologies offer a rich opportunity to explore genome-wide effects of the HD mutation and may help answer mechanistic questions. The application of many next generation DNA sequencing methods is a new luxury for researchers. DNA sequencing methods have undergone a rapid technical evolution which has accelerated the financial feasibility of applying DNA sequencing involved methods on a routine basis. In this thesis, two high throughput analysis techniques, RNA-Seq and ChIP-Seq, were applied to Huntington's disease models to better understand disease mechanisms, and a third high throughput analysis technique, Ribo-Seq, was optimized for future HD studies. RNA-Seq on Huntington's disease model mice and their wildtype littermates demonstrated extensive and progressive dysregulation of the transcriptome in HD striatum and cortex, with most of the affected genes having a lower steady state expression in mutant tissues. ChIP-Seq with an antibody against trimethylated- Histone3-Lysine4 (H3K4Me3) demonstrated both a general reduction of H3K4me3 levels and a unique histone profile at the promoters of HD downregulated genes. Analysis of RNA-Seq results for splicing changes showed that mutant HTT itself is mis-spliced. This mis-splicing product is translated into a small, pathogenic HTT fragment which may have considerable implications for HD therapeutic design. In addition to CNS degeneration, severe muscle dysfunction is an early clinical observation in HD and many CAG repeat expansion disorders. Proper muscle form and function is dependent on an extensive alternative splicing program. Thus RNASeq data on muscle tissue from mouse models of several CAG expansion disorders was examined for genome-wide splicing alterations. Widespread mis-splicing was detected in the muscle of both Spinocerebellar ataxia 7 and Huntington's disease mouse models and minor splicing dysregulation was detected in Spinal-bulbar muscular atrophy. Lastly, methods were developed to examine translational control and mRNA localization in the brain of Huntington's disease mice. Concurrent Ribo-Seq and RNA-Seq in diseased and wildtype animals would answer if there was altered translational control. The Ribo-Seq protocol designed in cell culture was optimized for use on brain tissue and is ready for application in HD mouse models. Analysis of the localization of mRNA transcripts to neuronal projections can be studied by combining fractionation experiments with RNA-Seq. A method to prepare high quality RNA from isolated neuronal projections was developed and is now applicable to RNA-Seq studies.
by Theresa Anne Wasylenko.
Ph. D.
APA, Harvard, Vancouver, ISO, and other styles
31

Farrell, Andrew R. "Expanding the horizons of next generation sequencing with RUFUS." Thesis, Boston College, 2014. http://hdl.handle.net/2345/bc-ir:104176.

Full text
Abstract:
Thesis advisor: Gabor T. Marth
To help improve the analysis of forward genetic screens, we have developed an efficient and automated pipeline for mutational profiling using our reference guided tools including MOSAIK and FREEBAYES. Studies using next generation sequencing technologies currently employ either reference guided alignment or de novo assembly to analyze the massive amount of short read data produced by second generation sequencing technologies; the far more common approach being reference guided alignment due to the massive computational and sequencing costs associated with de novo assembly. The success of reference guided alignment is dependent on three factors; the accuracy of the reference, the ability of the mapper to correctly place a read, and the degree to which a variant allele differs from the reference. Reference assemblies are not perfect and none are entirely complete. Moreover, read mappers can only map reads in genomic locations that are unique enough to confidently place reads; paralogous sections, such as related gene families, cannot be characterized and are often ignored. Further, variant alleles that drastically alter the subject's DNA, such as insertions or deletions (INDELs), will not map to the reference and are either entirely missed or require further downstream analysis to characterize. Most importantly, reference guided methods are restricted to organisms for which such reference genomes have been assembled. The current alternative, de novo assembly of a genome, is prohibitively expensive for most labs requiring deep read coverage from numerous different library preparations as well as massive computing power. To address the shortcomings of current methods, while eliminating the costs intrinsic to de novo sequence assembly, we developed RUFUS, a novel, completely reference-independent variant discovery tool. RUFUS directly compares raw sequence data from two or more samples and identifies groups of reads unique to one or the other sample. RUFUS has at least the same variant detection sensitivity as mapping methods, with greatly increased specificity for SNPs and INDEL variation events. RUFUS is also capable of extremely sensitive copy number detection, without any restriction on event length. By modeling the underlying k-mer distribution, RUFUS produces a specific copy number spectrum for each individual sample. Applying a Bayesian detection method to detect changes in k-mer content between two samples, RUFUS produces copy number calls that are equally as sensitive as traditional copy number detection methods with far fewer false positives. Our data suggest that RUFUS' reference-free approach to variant discovery is able to substantially improve upon existing variant detection methods: reducing reference biases, reducing false positive variants, and detecting copy number variants with excellent sensitivity and specificity
Thesis (PhD) — Boston College, 2014
Submitted to: Boston College. Graduate School of Arts and Sciences
Discipline: Biology
APA, Harvard, Vancouver, ISO, and other styles
32

Innocenti, Nicolas. "Data Analysis and Next Generation Sequencing : Applications in Microbiology." Doctoral thesis, KTH, Beräkningsbiologi, CB, 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-173219.

Full text
Abstract:
Next Generation Sequencing (NGS) is a new technology that has revolutionized the way we study living organisms. Where previously only a few genes could be studied at a time through targeted direct probing, NGS offers the possibility to perform measurements for a whole genome at once. The drawback is that the amount of data generated in the process is large and extracting useful information from it requires new methods to process and analyze it. The main contribution of this thesis is the development of a novel experimental method coined tagRNA-seq, combining 5’tagRACE, a previously developed technique, with RNA-sequencing technology. Briefly, tagRNA-seq makes it possible to identify the 5’ ends of RNAs in bacteria and directly probe for their type, primary or processed, by ligating short RNA sequences, the tags, to the beginnings of RNA molecules. We used the method to directly probe for transcription start and processing sites in two bacterial species, Escherichiacoli and Enterococcus faecalis. It was also used to study polyadenylation in E. coli, where the ability to identify processed RNA molecules proved to be useful to separate direct and indirect regulatory effects of this mechanism. We also demonstrate how data from tagRNA-seq experiments can be used to increase confidence on the discovery of anti-sense transcripts in bacteria. Analyses of RNA-seq data obtained in the context of these experiments revealed subtle artifacts in the coverage signal towards gene ends, that we were able to explain and quantify based Kolmogorov’s broken stick model. We also discovered evidences for circularization of a few RNA transcripts, both in our own data sets and publicly available data. Designing the tags used in tagRNA-seq led us to the problem of words absent from a text. We focus on a particular subset of these, the minimal absent words (MAWs), and develop a theory providing a complete description of their size distribution in random text. We also show that MAWs in genomes from viruses and living organisms almost always exhibit a behavior different from random texts in the tail of the distribution, and that MAWs from this tail are closely related to sequences present in the genome that preferentially appear in regions with important regulatory functions. Finally, and independently from tagRNA-seq, we propose a new approach to the problem of bacterial community reconstruction in metagenomic, based on techniques from compressed sensing. We provide a novel algorithm competing with state-of-the-art techniques in the field.

QC 20150930

APA, Harvard, Vancouver, ISO, and other styles
33

Shahbazi, Daniel. "Investigating streptococcal biodiversity in sepsis using next-generation sequencing." Thesis, Högskolan i Skövde, Institutionen för biovetenskap, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:his:diva-16248.

Full text
Abstract:
Sepsis is one of the leading causes for fatalities in the intensive care unit, and also one of the biggest health problems worldwide. It is a disease caused primarily by bacterial infections but can also be caused by viral or fungal infections. Since it is such a big health problem being associated with increased risk of sepsis, coupled with longer stays in the intensive care unit, the need for fast diagnosis and treatment is very important. Currently, culture is the leading diagnostic method for identification of bacteria, although other methods are currently being tested to improve identification time and decrease cost and workload. Next generation sequencing (NGS) has the capacity to output several million reads in a single experiment, making it very fast and relatively cheap compared to other older sequencing methods such as Sanger sequencing. The ability to analyze genes and even whole genomes, opens the possibilities to identify factors such as bacterial species, virulence genes and antibiotic resistance genes. The aim of this study was to find any possible correlations between 16 species of streptococci and clinical data in patients with suspected sepsis. Initial species identification was performed using MALDI-TOF before the samples were sequenced using NGS. Sequence files were then quality controlled and trimmed before being assembled. Following assembly, coverage was controlled for all assembled genomes before the downstream analysis started. Different tools such as 16S RNA species identification, multi locus sequence typing and antibiotic resistance finder were used, among other tools. The results were extremely mixed, with the overall quality of the data being of good quality, but the assembly and downstream analysis being worse. The most consistent species was S. pyogenes. No correlation between sepsis patients and relevant clinical data was found. The mixed quality of results from assembly and downstream analysis were most likely contributed to difficulties in culturing and sequencing of the streptococci. Finding ways to circumvent these problems would most likely aid in general sequencing of streptococcal species, and hopefully in clinical applications as well.
APA, Harvard, Vancouver, ISO, and other styles
34

Yu, Xiaoqing. "Statistical Methods and Analyses for Next-generation Sequencing Data." Case Western Reserve University School of Graduate Studies / OhioLINK, 2014. http://rave.ohiolink.edu/etdc/view?acc_num=case1403708200.

Full text
APA, Harvard, Vancouver, ISO, and other styles
35

Khuder, Basil. "Human Genome and Transcriptome Analysis with Next-Generation Sequencing." University of Toledo Health Science Campus / OhioLINK, 2017. http://rave.ohiolink.edu/etdc/view?acc_num=mco1501886695490104.

Full text
APA, Harvard, Vancouver, ISO, and other styles
36

Lee, Michael. "Next Generation Sequencing Strategies to Investigate Telomeres in Cancer." Thesis, The University of Sydney, 2019. https://hdl.handle.net/2123/21844.

Full text
Abstract:
Telomeres are regions of repetitive DNA at the ends of human chromosomes that function to maintan the integrity of the genome. Telomere attrition is associated with celluar ageing, whilst telomere maintenance is a prerequisite for replicative immortality in cancer. There are two telomere maintenance mechanisms (TMM) that cancer cells can utilize, the enzyme telomerase, or the Alternative Lengthening of Telomeres (ALT) pathway. These two mechanisms synthesise telomeres in very distinct ways leading to differences in their telomere sequence composition and length. The molecular pathways involved with the selective activation of one TMM over the other remain unclear. In the last decade, whole genome sequencing (WGS) has proven to be an invaluable tool for the study of cancer, leading to the discovery of novel gene mutations that either drive the disease or confer an increased risk of developing it. The utility of this technique has led to the creation of vast cancer WGS data resources, in particular The Cancer Genome Atlas (TCGA) and the International Cancer Genome Consortium (ICGC), that are available for cancer researchers worldwide to use. This provides an excellent resource from which we can better understand and associate genetic markers and telomere sequence content across cancers, as well as between tumours that utilise the telomere maintenance mechanisms telomerase and ALT. In order to utilise these available datasets, we require a WGS‐based approach to determine the TMM status of a tumour, as experimental validation requires obtaining cellular material. We propose that differences exist in telomere sequence composition and length between ALT and telomerase cancers that can be used to determine the TMM status of a tumour from WGS data. In this thesis, we first compared a range of WGS‐based telomere content measurement tools against the lab‐based technique q‐PCR, in order to assess their accuracy in quantitating telomere content, whilst simultaneously enriching for variant telomeric sequences. We then applied the best of these tools to two experimentally validated tumour datasets, pancreatic neuroendocrine tumours and melanomas, in order to directly analyse and compare the telomere sequence content between tumours that utilise ALT and those that do not. Finally, we exploited the differences in telomere sequence content in order to develop a classifier capable of determining the ALT status of a tumour from WGS data, and applied it to WGS data from 821 TCGA tumours, to identify the molecular pathways associated with the activation of ALT. We were able to demonstate that WGS‐based telomere content measurement tools perform well, producing comparable results to q‐PCR, with R2 = 0.9516. We have developed a methodology for the accuracte quantification of variant repeats within telomere sequences, identifying a number of differences in telomere sequence composition between ALT positive (+ve) and ALT negative (‐ve) tumours. We have demonstated the utility of this methodology to develop a WGS‐based classifier capable of predicting the ALT status of a tumour with 91.6% accuracy. Analysis of pathway mutations that were under‐represented in ALT tumours, across 1,075 tumour samples, revealed that the autophagy, cell cycle control of chromosomal replication, and transcriptional regulatory network in embryonic stem cells pathways were involved in the survival of ALT tumours. Overall, we have demonstrated the capability and utility of WGS to investigate telomere sequence content, shown how telomere sequence content can be used to stratify cancers by TMM, and applied this to cancer WGS datasets to elucidate the genetic changes that associate with each TMM. This thesis provides a useful resource for future studies seeking to investigate the role of telomere sequence content in disease and overall health.
APA, Harvard, Vancouver, ISO, and other styles
37

Helmuth, Johannes [Verfasser]. "Robust Normalization of Next Generation Sequencing Data / Johannes Helmuth." Berlin : Freie Universität Berlin, 2017. http://d-nb.info/1136319379/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
38

Antanaviciute, Agne. "Novel algorithm development for 'next generation' sequencing data analysis." Thesis, University of Leeds, 2017. http://etheses.whiterose.ac.uk/20734/.

Full text
Abstract:
In recent years, the decreasing cost of ‘Next generation’ sequencing has spawned numerous applications for interrogating whole genomes and transcriptomes in research, diagnostic and forensic settings. While the innovations in sequencing have been explosive, the development of scalable and robust bioinformatics software and algorithms for the analysis of new types of data generated by these technologies have struggled to keep up. As a result, large volumes of NGS data available in public repositories are severely underutilised, despite providing a rich resource for data mining applications. Indeed, the bottleneck in genome and transcriptome sequencing experiments has shifted from data generation to bioinformatics analysis and interpretation. This thesis focuses on development of novel bioinformatics software to bridge the gap between data availability and interpretation. The work is split between two core topics – computational prioritisation/identification of disease gene variants and identification of RNA N6 -adenosine Methylation from sequencing data. The first chapter briefly discusses the emergence and establishment of NGS technology as a core tool in biology and its current applications and perspectives. Chapter 2 introduces the problem of variant prioritisation in the context of Mendelian disease, where tens of thousands of potential candidates are generated by a typical sequencing experiment. Novel software developed for candidate gene prioritisation is described that utilises data mining of tissue-specific gene expression profiles (Chapter 3). The second part of chapter investigates an alternative approach to candidate variant prioritisation by leveraging functional and phenotypic descriptions of genes and diseases from multiple biomedical domain ontologies (Chapter 4). Chapter 5 discusses N6 AdenosineMethylation, a recently re-discovered posttranscriptional modification of RNA. The core of the chapter describes novel software developed for transcriptome-wide detection of this epitranscriptomic mark from sequencing data. Chapter 6 presents a case study application of the software, reporting the previously uncharacterised RNA methylome of Kaposi’s Sarcoma Herpes Virus. The chapter further discusses a putative novel N6-methyl-adenosine -RNA binding protein and its possible roles in the progression of viral infection.
APA, Harvard, Vancouver, ISO, and other styles
39

Li, Zhiwei. "Characterising copy number polymorphisms using next generation sequencing data." Thesis, Uppsala universitet, Institutionen för biologisk grundutbildning, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-386050.

Full text
Abstract:
We developed a pipeline to identify the copy number polymorphisms (CNPs) in the Northern Swedish population using whole genome sequencing (WGS) data. Two different methodologies were applied to discover CNPs in more than 1,000 individuals. We also studied the association between the identified CNPs with the expression level of 438 plasma proteins collected in the same population. The identified CNPs were summarized and filtered as a population copy number matrix for 1,021 individuals in 243,987 non-overlapping CNP loci. For the 872 individuals with both WGS and plasma protein biomarkers data, we conducted linear regression analyses with age and sex as covariance. From the analyses, we detected 382 CNP loci, clustered in 30 collapsed copy number variable regions (CNVRs) that were significantly associated with the levels of 17 plasma protein biomarkers (p < 4.68×10-10).
APA, Harvard, Vancouver, ISO, and other styles
40

Giollo, Manuel. "Computational Approaches to Address the Next-Generation Sequencing Era." Doctoral thesis, Università degli studi di Padova, 2015. http://hdl.handle.net/11577/3424280.

Full text
Abstract:
In this thesis, I propose new algorithms and models to address biological problems. Computer science in fact plays a key role in proteomics and genetics research due to the advent of big datasets. In the context of protein study, I developed new methods for protein function prediction based on information retrieval principles. By using heterogeneous source of knowledge, like graph search and sequence similarity, I designed a tool called INGA that can be used to annotate entire genomes. It has been benchmarked during the Critical Assessment of Function Annotation challenge, and it proved to be one of the most effective approach for function inference. To better characterize proteins from the structural point of view, I proposed a protein conformers detection strategy based on residue interaction network (RIN) data. RIN graphs were extended to deal with the time-dependent protein coordinate fluctuations, and were generated by clustering algorithms. An implementation called RING MD highlighted effectively the key amino acids known to be functionally relevant in Ubiquitin. These amino acids in fact are very important to explain the protein three-dimensional dynamics. With the same rationale, RIN graphs were used also to predict the impact of mutations within a protein structure. By combining information about a mutant node in the network and its features, an artificial neural network was trained to estimate the free Gibbs energy change of a protein. Extreme changes in the internal energy might lead to the protein unfolding, and possibly to disease. The reduction of a protein flexibility may hamper its function as well. As an example, the extreme fluctuations observed in intrinsically disordered proteins (IDPs) are fundamental for their activities. To better understand IDPs, I contributed in the collection of the largest dataset of disordered regions. In the following analysis, it was shown what are the typical functions of these sequences and the biological processes where they are involved. Due to the importance of their detection, a comprehensive assessment of disorder predictors was performed to show what are the state-of-the-art methods and their limitations. In the context of genetics, I focused on phenotype prediction. During the Critical Assessment of Genome Interpretation (CAGI), I proposed new approaches for the analysis of exome data to prioritize the risk of Crohn's disease and abnormal cholesterol levels. These are often defined as complex disease, since the mechanism behind their insurgence is still unknown. In my study, human samples with an enrichment of mutations in critical genes were predicted to have an high genetic risk. In addition to disease associated genes, protein interaction networks were considered to better account for variants accumulation in biological pathways. Such strategy was shown to be among the best approaches by CAGI organizers. In the simpler case of Mendelian traits, with BOOGIE I designed a method for human blood groups prediction based on exome data. It uses a specialized version of nearest neighbor algorithm in order to match the gene variants in an unannotated exome with the ones available in a reference knowledge base. The most similar hit is used to transfer the blood group. With an accuracy above 90%, BOOGIE is a proof-of-concept that shows the potential applications of genetic prediction, and can be easily extended to any Mendelian trait. To summarize, this thesis is a partial answer to the exponential growth of sequences available that need further experiments. By integrating heterogeneous information and designing new predictive models based on machine learning, I developed novel tools for biological data analysis and classification. All implementations are freely available for the community and might be helpful during future investigations like in drug design and disease studies.
In questa tesi, vengono proposti nuovi algoritmi e modelli per affrontare problemi biologici. L'informatica svolge un ruolo chiave nella proteomica e nella ricerca genetica dovuto alla gestione delle grandi moli di dati biologici. Nel contesto dello studio di proteine, ho sviluppato nuovi metodi per la predizione delle loro funzioni basati su principi di reperimento dell'informazione. Utilizzando fonti eterogenee di conoscenza, come la ricerca su grafi e la similarità di sequenze, ho progettato uno strumento chiamato INGA che può essere utilizzato per annotare interi genomi. Questo è stato valutato imparzialmente dal Critical Assessment of Function Annotation, e ha dimostrato di essere uno degli approcci più efficaci per l'inferenza di funzione. Per meglio caratterizzare le proteine dal punto di vista strutturale, ho proposto una strategia di rilevamento delle conformazioni delle proteine basata su rete di interazione di residui (RIN). Le reti RIN sono state quindi estese per gestire le fluttuazioni temporali delle coordinate atomiche. Tali grafi sono stati infine generati automaticamente da algoritmi di clustering. Un'implementazione chiamata RING MD ha evidenziato efficacemente i principali amminoacidi noti per essere funzionalmente rilevanti nell'Ubiquitina. Questi aminoacidi sono infatti molto importanti per spiegare la dinamica strutturale della proteina. Con la stessa logica, sono stati usati i grafi RIN anche per prevedere l'impatto delle mutazioni all'interno di una struttura proteica. Combinando informazioni sul nodo mutante in una rete e le sue caratteristiche, una rete neurale artificiale è stata addestrata per stimare la variazione di energia libera di Gibbs all'interno di una proteina. Cambiamenti estremi nell'energia interna potrebbe portare all'unfolding della proteina, ed eventualmente ad una malattia. D'altro canto, anche la riduzione della flessibilità proteica può ostacolare la sua funzione. Ad esempio, le fluttuazioni estreme osservate nelle proteine intrinsecamente disordinate (IDP) sono fondamentali per le loro attività. Per studiare le IDP, ho contribuito alla raccolta del più grandi dataset di regioni disordinate mai esistito. Nella seguente analisi è stato dimostrato quali sono le funzioni tipiche di queste sequenze e i processi biologici in cui sono coinvolte. Data l'importanza della loro identificazione, una valutazione globale di predittori del disordine è stata eseguita per mostrare quali sono i metodi più efficaci e le loro limitazioni. Nel contesto della genetica, mi sono concentrato sulla previsione di fenotipi. Durante il Critical Assessment of Genome Interpretation (CAGI), ho proposto nuovi approcci per l'analisi dei dati dell'esoma progettati per valutare il rischio di morbo di Crohn e di ipercolesterolemia. Queste sono spesso definite come malattie complesse, dal momento che il meccanismo alla base della loro insorgenza è ancora sconosciuto. Nel mio studio, i campioni umani con un arricchimento di mutazioni in geni critici sono stati predetti come soggetti a rischio genetico elevato. Oltre ai geni associati alla malattia, le reti di interazione proteiche sono state considerate per valutare l'accumulo di varianti in pathway biologici. Tale strategia ha dimostrato di essere tra le migliori secondo gli organizzatori del CAGI. Nel caso più semplice dei tratti mendeliani, con BOOGIE ho progettato un metodo per la predizione dei gruppi sanguigni umani basata su dati di esoma. Esso utilizza una versione specializzata dell'algoritmo nearest neighbour al fine di far corrispondere le varianti genetiche in un esoma non annotato con quelle disponibili in una base di conoscenza di riferimento. L'esempio più simile è usato per trasferire il gruppo sanguigno. Con una precisione superiore al 90%, BOOGIE è un prototipo che mostra le potenziali applicazioni della predizione genetica, e può essere facilmente esteso a qualsiasi tratto mendeliano. Riassumendo, questa tesi è una risposta parziale alla crescita esponenziale di sequenze disponibili che necessitano ulteriori esperimenti. Integrando informazioni eterogenee e la progettazione di nuovi modelli predittivi basati su apprendimento automatico, ho sviluppato nuovi strumenti per l'analisi di dati biologici e per la loro classificazione. Tutte le implementazioni sono liberamente disponibili per la comunità e potrebbero essere utili durante indagini future come in studi di malattie e nella progettazione di farmaci.
APA, Harvard, Vancouver, ISO, and other styles
41

Nicchia, Elena. "Development of a new diagnostic algorithm for the study of diseases caractherized by high genetic heterogeneity." Doctoral thesis, Università degli studi di Trieste, 2015. http://hdl.handle.net/10077/10854.

Full text
Abstract:
2013/2014
Next Generation Sequencing (NGS) technologies, such the Ion Torrent platform, could allow to simplify the diagnostic process of diseases characterized by an high genetic and phenotypic heterogeneity, because of the possibility to sequence simultaneously more genes and more patients in a single sequencing run. In order to develop a new diagnostic algorithm for rapid molecular diagnosis of these disorders, we have applied the Ion Torrent technology on two different genetically heterogeneous diseases, Fanconi anemia (FA) and inherited thrombocytopenias (IT). Since FA is a disorder better characterized than ITs, we first validated the Ion torrent technology on 30 samples (2 wild type and 28 FA), 25 of which were already analyzed with Sanger sequencing. Because of their low sequencing quality, we have excluded from this type of analysis 2 of the 28 FA samples. Then, comparing Ion Torrent and Sanger sequencing data, we have evaluated the sensitivity (95%) and the specificity (100%) of Ion Torrent technology. Moreover, in order to detect copy number variations (CNVs) in FA genes, we have improved a statistical analysis based on coverage sequencing data, confirming the presence of large intragenic deletions on FANCA in 5 patients. In summary we have characterized 25 of the 26 FA patients analyzed, identifying also 4 mutant alleles in the rare complementation group FANCL and FANCF and 10 mutations in loci different from genes causing the disease. Since we cannot exclude that new genes are involved in FA, the only patient without any mutation identified is suitable for whole exome analysis. Taking advantage from these good sequencing data, we have developed a diagnostic algorithm that combines the identification of both point mutations and CNVs. In order to verify if this new diagnostic process could be applied also to other genetically heterogeneous diseases, we have analyzed 21 IT patients, already characterized by Sanger sequencing. Among the 2225 variants identified by Ion torrent technology, using this new approach, we have select those (N=75, 56 different) potentially pathogenetic because of their frequency (MAF<0.01), or of their presence in IT mutation database o because of bioinformatics analysis. Thirty of these variants were confirmed by Sanger sequencing, 14 (12 different) of which localized in loci different from the gene causing the disease. It would be interesting to carry out functional studies on these additional variants to unravel the molecular basis of ITs. In summary we were able to characterized 17 of the 21 IT patients, including 2 patients with deletions in RBM8A (Thrombocytopenia and Absent Radii syndrome, TAR). The remaining 4 mutant alleles were not detected because of a low sequencing coverage. In conclusion, according to our data, we can consider the Ion Torrent technology and in particular the diagnostic algorithm proposed in our study, as a feasible approaches for the study of diseases characterized by high genetic and phenotypic heterogeneity. RIASSUNTO Le tecnologie di Next Generation Sequencing (NGS) consentono di analizzare più geni e più campioni contemporaneamente. In questo modo potrebbe essere possibile ridurre i tempi e i costi di analisi di tutte quelle patologie caratterizzate da elevata eterogeneità genetica e fenotipica, la cui caratterizzazione risulta essere spesso complessa e dispendiosa. Al fine di elaborare un nuovo algoritmo diagnostico che consenta la rapida elaborazione di una diagnosi molecolare di tali patologie, abbiamo deciso di validare una tra le più innovative tecnologie NGS attualmente in commercio, la metodica Ion Torrent, su due differenti malattie, entrambe caratterizzate da eterogeneità genetica. l’anemia di Fanconi (FA) e le piastrinopenie ereditarie (IT). Siccome la FA è una patologia meglio caratterizzata rispetto alle IT, durante la prima fase di questo lavoro di tesi abbiamo analizzato 30 campioni (25 dei quali già precedentemente analizzati con sequenziamento Sanger), di cui 2 wild type e 28 affetti. In seguito all’esclusione dalla nostra analisi di 2 campioni FA a causa di una bassa qualità di sequenziamento, abbiamo determinato la sensibilità (95%) e la specificità (100%) della nuova metodica confronto i dati di sequenziamento Ion Torrent e quelli Sanger a nostra disposizione. Inoltre, utilizzando i dati di copertura della sequenza, abbiamo messo a punto un’analisi statistica volta all’identificazione delle Copy Number Variation (CNV), confermando le delezioni a carico del gene FANCA presenti in 5 pazienti. Abbiamo quindi caratterizzato 25 dei 26 pazienti analizzati, identificando inoltre 2 casi con mutazioni nei rari gruppi di complementazione FANCF e FANCL e 10 mutazioni in loci differenti dai geni causativi. Poiché non escludiamo la possibilità che un nuovo gene possa essere coinvolto nella patologia, riteniamo che l’unico paziente ancora privo di diagnosi molecolare possa essere un buon candidato per lo studio dell’esoma. Infine, avvalendoci dei buoni risultati ottenuti, abbiamo elaborato un nuovo processo diagnostico con il quale identificare in modo semplice e rapido sia le mutazioni sia le CNV a carico dei 16 geni coinvolti nella FA. Nella seconda parte del nostro studio, abbiamo verificato se l’applicazione di tale algoritmo possa essere estesa anche ad altre patologie ad elevata eterogeneità genetica. Per questo motivo abbiamo analizzato 21 campioni affetti da piastrinopenie ereditarie, già precedentemente analizzati mediante sequenziamento Sanger. Grazie all’algoritmo proposto abbiamo potuto selezionare tra le 2225 varianti identificate le 75 (56 differenti) che sono risultate essere potenzialmente patogenetiche in base alla loro frequenza nella popolazione (MAF<0.01), alla loro presenza nei database di mutazione e all’analisi bioinformatica di patogenicità. Trenta (27 differenti) di queste varianti sono state confermate mediante sequenziamento Sanger, di cui in particolare 14 (12 differenti) presenti in geni diversi da quelli causativi. Alla luce di questo dato si rendono necessari studi funzionali su tali varianti al fine di comprendere i meccanismi molecolari alla base delle piastrinopenie ereditarie. Infine, utilizzando l’algoritmo proposto, è stato possibile confermare la diagnosi molecolare in 17 dei 21 pazienti IT, compresi i 2 affetti da trombocitopenia con assenza del radio (TAR) e portatori di una delezione sul cromosoma 1q21.1. I restanti 4 alleli mutati non sono stati identificati a causa di una bassa copertura di sequenziamento. In conclusione, in base ai dati raccolti sui campioni affetti da FA e IT, possiamo affermare che la tecnologia di sequenziamento Ion Torrent e l’algoritmo diagnostico da noi proposto sono degli strumenti utili per ottenere una diagnosi molecolare completa, veloce ed economica.
XXVII Ciclo
1984
APA, Harvard, Vancouver, ISO, and other styles
42

Emelianova, Katie. "Using next generation sequencing to investigate the generation of diversity in the genus Begonia." Thesis, University of Edinburgh, 2017. http://hdl.handle.net/1842/29584.

Full text
Abstract:
Begonia is one of the most diverse genera on the planet, with a species count approaching 2000 and a distribution across tropics in South America, Africa and South East Asia. The genus has occupied a vast range of niches; many highly variable growth forms can be found across the distribution, and species exhibit very diverse morphologies, even in closely related species. A recent study has revealed a putative whole genome duplication (WGD) event in the evolutionary history of Begonia, which has prompted an interest in investigating the impact gene and genome duplication has had on the diversification of Begonia. To answer questions about phenotypic and ecological diversification in Begonia, two species from South America, B. conchifolia and B. plebeja were chosen as study species based on their close phylogenetic relationship and divergent ecology and phenotype. RNA-seq data for six tissues from B. conchifolia and B. plebeja was generated using the Illumina sequencing platform, and normalised relative expression data was obtained by mapping reads to transcripts predicted from the B. conchifolia draft genome. A bioinformatics pipeline was devised to compare expression profiles across 6 different tissues between duplicated gene pairs shared between B. conchifolia and B. plebeja. Gene duplicate pairs were selected as candidates if they showed divergent expression in one species but not in another. Such duplicate pairs are suggestive of neofunctionalization in one species, providing evidence of a potential basis for phenotypic divergence and diversification between B. conchifolia and B. plebeja. Two duplicate pairs were identified as showing such divergent expression patterns as well as being functionally ecologically relevant, Chalcone Synthase and 3-Ketoacyl-CoA synthase, involved in anthocyanin biosynthesis and wax biosynthesis respectively. Investigation of expression and duplication patterns in both gene families showed the candidate gene families to be strikingly different. While 3-Ketoacyl-CoA synthase showed deeper duplications shared with outgroup taxa, Chalcone Synthase appeared to be expanded very recently, with a burst of duplications specific to the genus. 3-Ketoacyl-CoA synthase showed examples of partitioned expression by tissue for different gene family members, with at least five members of the gene family being highly expressed in one or two tissues only. Chalcone Synthase, however, showed dominance of one basal gene family member. Other Chalcone Synthase members, though expressed at lower levels, showed some evidence of reciprocal silencing in B. plebeja, though this pattern was not observed in B. conchifolia. Further investigation of the Chalcone Synthase gene family revealed lineage specific duplication in B. plebeja, and more extensive differential duplication patterns were found across other South American Begonias. Additionally, signals of positive selection were found in two branches on the Chalcone Synthase phylogeny.
APA, Harvard, Vancouver, ISO, and other styles
43

Khan, Azeem. "Affordable and accesible rolony template preparation for next-generation sequencing." Thesis, Boston University, 2012. https://hdl.handle.net/2144/12443.

Full text
Abstract:
Thesis (M.A.)--Boston University
The first draft of the entire human genome was released in 2000, bringing with it the potential for personalized medicine in which there would be customization of health care, with practices and decisions being specially suited to each individual patient by the use of their genetic code. However, the costs and duration of the sequencing with the available technology at that time still left genome analysis out of reach for the majority of people. Since then, there has been an ongoing challenge to lower the cost of sequencing, and to make it more accessible to the public. Newer methods of genome sequencing using circularized human DNA have now been developed that have the potential to both lower the cost and speed up the process. One such method is rolony technology in which the DNA is circularized, amplified, and then fluorescent probes are ligated to the DNA template for sequencing. The order of bases is determined by fluorescence of the ligated and bound probes. The main hurdle with this technology remains the lack of good quality sequencing templates. A good template allows for a rolony to be produced that is efficient in circularization and amplification. It has been proposed that sequence and secondary structure contribute to the quality of rolony, but the exact parameters have not yet been determined. In the work describe here, different rolony templates were chosen and studied for their sequencing potential. The hypotheses tested were whether a sequence specific secondary structure was required for circularization, whether a sequence specific secondary structure was required for Rolling Circle Amplification, and if the secondary structures assisted in folding the DNA into rolonies. It was determined through various experiments that template sequence, and the secondary structure of the template are representative of the quality of rolony produced.
APA, Harvard, Vancouver, ISO, and other styles
44

Alshanbari, Huda Mohammed H. "Additive Cox proportional hazards models for next-generation sequencing data." Thesis, University of Leeds, 2017. http://etheses.whiterose.ac.uk/19739/.

Full text
Abstract:
Eighty-Nine Non-Small Cell Lung Cancer (NSCLC) patients experience chromosomal rearrangements called Copy Number Alteration (CNA), where the cells have abnormal number of copies in one or more regions in their genome, this genetic alteration are known to drive cancer development. An important aim of this thesis is to propose a way to combine the clinical covariate as fixed predictors with CNAs genomics windows as smoothing terms using the penalized additive Cox Proportional Hazards (PH) model. Most of the proposed prediction methods assume linearity of the CNAs genomic windows along with the clinical covariates. However, the continuous covariates can affect the hazard via more complicated nonlinear functional forms. Therefore, Cox PH model with continuous covariate are likely misspecified, because it is not fitting the correct functional form for the continuous covariates. Some reports of the work on combining the clinical covariates with high-dimensional genomic data in a clinical genomic prediction are based on standard Cox PH model. Most of them focus on applying variable selection to high-dimensional CNA genomic data. Our main interest is to propose a variable selection procedure to select important nonlinear effects from CNAs genomic-windows. Two different approaches of feature selection are presented which are discrete and shrinkage. Discrete feature selection is based on penalized univariate variable selection, which identify the subset of the CNAs genomic-windows have the strongest effects on the survival time, while feature selection by shrinkage works by adding a second penalty to the penalized partial log-likelihood, that leads to penalizing the smoothing coefficients in the model, as a result some of the smoothing coefficient are being set to the zero. For the NSCLC dataset, we find that the size of the tumor cells and spread cancer into the lymph nodes are significant factors that increase the hazard of the patients survival, and the estimate of the smooth log hazard ratio curves identify that some of the significant CNA genomic-windows contribute a higher or lower hazard of death to the survival of some significant CNA genomic-windows across the genome.
APA, Harvard, Vancouver, ISO, and other styles
45

Graham, Joseph (Joseph Arthur). "An analysis of the next generation DNA sequencing technology market." Thesis, Massachusetts Institute of Technology, 2007. http://hdl.handle.net/1721.1/42360.

Full text
Abstract:
Thesis (S.M.)--Massachusetts Institute of Technology, System Design and Management Program, 2007.
Includes bibliographical references (p. 57-60).
While there is no shortage of successful and failed biotechnology ventures, it is still very difficult to gage, a priori, how a new company will fare in this industry. In many cases new biotechnology ventures are driven by rapidly evolving technology and emergent customer needs, both unpredictable by nature. Also, the Biotech Industry faces increased public and federal scrutiny as companies attempt to navigate murky ethical and legal waters. This thesis will explore the ongoing development of the next generation DNA sequencing market in an effort to predict exactly which factors will play a role in determining who will ultimately succeed. This will be accomplished through an analysis incorporating a combination of historical precedents in this industry and traditional market theories. The goal is to produce a set of dimensions along which to judge the current and future participants in this market in order to determine which are most likely to succeed.
by Joseph Graham.
S.M.
APA, Harvard, Vancouver, ISO, and other styles
46

Mayo, Thomas Richard. "Machine learning for epigenetics : algorithms for next generation sequencing data." Thesis, University of Edinburgh, 2018. http://hdl.handle.net/1842/33055.

Full text
Abstract:
The advent of Next Generation Sequencing (NGS), a little over a decade ago, has led to a vast and rapid increase in the generation of genomic data. The drastically reduced cost has in turn enabled powerful modifications that can be used to investigate not just genetic, but epigenetic, phenomena. Epigenetics refers to the study of mechanisms effecting gene expression other than the genetic code itself and thus, at the transcription level, incorporates DNA methylation, transcription factor binding and histone modifications amongst others. This thesis outlines and tackles two major challenges in the computational analysis of such data using techniques from machine learning. Firstly, I address the problem of testing for differential methylation between groups of bisulfite sequencing data sets. DNA methylation plays an important role in genomic imprinting, X-chromosome inactivation and the repression of repetitive elements, as well as being implicated in numerous diseases, such as cancer. Bisulfite sequencing provides single nucleotide resolution methylation data at the whole genome scale, but a sensitive analysis of such data is difficult. I propose a solution that uses a powerful kernel-based machine learning technique, the Maximum Mean Discrepancy, to leverage well-characterised spatial correlations in DNA methylation, and adapt the method for this particular use. I use this tailored method to analyse a novel data set from a study of ageing in three different tissues in the mouse. This study motivates further modifications to the method and highlights the utility of the underlying measure as an exploratory tool for methylation analysis. Secondly, I address the problem of predictive and explanatory modelling of chromatin immunoprecipitation sequencing data (ChIP-Seq). ChIP-Seq is typically used to assay the binding of a protein of interest, such as a transcription factor or histone, to the DNA, and as such is one of the most widely used sequencing assays. While peak callers are a powerful tool in identifying binding sites of sparse and clean ChIPSeq profiles, more broad signals defy analysis in this framework. Instead, generative models that explain the data in terms of the underlying sequence can help uncover mechanisms that predicting binding or the lack thereof. I explore current problems with ChIP-Seq analysis, such as zero-inflation and the use of the control experiment, known as the input. I then devise a method for representing k-mers that enables the use of longer DNA sub-sequences within a flexible model development framework, such as generalised linear models, without heavy programming requirements. Finally, I use these insights to develop an appropriate Bayesian generative model that predicts ChIP-Seq count data in terms of the underlying DNA sequence, incorporating DNA methylation information where available, fitting the model with the Expectation-Maximization algorithm. The model is tested on simulated data and real data pertaining to the histone mark H3k27me3. This thesis therefore straddles the fields of bioinformatics and machine learning. Bioinformatics is both plagued and blessed by the plethora of different techniques available for gathering data and their continual innovations. Each technique presents a unique challenge, and hence out-of-the-box machine learning techniques have had little success in solving biological problems. While I have focused on NGS data, the methods developed in this thesis are likely to be applicable to future technologies, such as Third Generation Sequencing methods, and the lessons learned in their adaptation will be informative for the next wave of computational challenges.
APA, Harvard, Vancouver, ISO, and other styles
47

Chen, Xi. "Bayesian Integration and Modeling for Next-generation Sequencing Data Analysis." Diss., Virginia Tech, 2016. http://hdl.handle.net/10919/71706.

Full text
Abstract:
Computational biology currently faces challenges in a big data world with thousands of data samples across multiple disease types including cancer. The challenging problem is how to extract biologically meaningful information from large-scale genomic data. Next-generation Sequencing (NGS) can now produce high quality data at DNA and RNA levels. However, in cells there exist a lot of non-specific (background) signals that affect the detection accuracy of true (foreground) signals. In this dissertation work, under Bayesian framework, we aim to develop and apply approaches to learn the distribution of genomic signals in each type of NGS data for reliable identification of specific foreground signals. We propose a novel Bayesian approach (ChIP-BIT) to reliably detect transcription factor (TF) binding sites (TFBSs) within promoter or enhancer regions by jointly analyzing the sample and input ChIP-seq data for one specific TF. Specifically, a Gaussian mixture model is used to capture both binding and background signals in the sample data; and background signals are modeled by a local Gaussian distribution that is accurately estimated from the input data. An Expectation-Maximization algorithm is used to learn the model parameters according to the distributions on binding signal intensity and binding locations. Extensive simulation studies and experimental validation both demonstrate that ChIP-BIT has a significantly improved performance on TFBS detection over conventional methods, particularly on weak binding signal detection. To infer cis-regulatory modules (CRMs) of multiple TFs, we propose to develop a Bayesian integration approach, namely BICORN, to integrate ChIP-seq and RNA-seq data of the same tissue. Each TFBS identified from ChIP-seq data can be either a functional binding event mediating target gene transcription or a non-functional binding. The functional bindings of a set of TFs usually work together as a CRM to regulate the transcription processes of a group of genes. We develop a Gibbs sampling approach to learn the distribution of CRMs (a joint distribution of multiple TFs) based on their functional bindings and target gene expression. The robustness of BICORN has been validated on simulated regulatory network and gene expression data with respect to different noise settings. BICORN is further applied to breast cancer MCF-7 ChIP-seq and RNA-seq data to identify CRMs functional in promoter or enhancer regions. In tumor cells, the normal regulatory mechanism may be interrupted by genome mutations, especially those somatic mutations that uniquely occur in tumor cells. Focused on a specific type of genome mutation, structural variation (SV), we develop a novel pattern-based probabilistic approach, namely PSSV, to identify somatic SVs from whole genome sequencing (WGS) data. PSSV features a mixture model with hidden states representing different mutation patterns; PSSV can thus differentiate heterozygous and homozygous SVs in each sample, enabling the identification of those somatic SVs with a heterozygous status in the normal sample and a homozygous status in the tumor sample. Simulation studies demonstrate that PSSV outperforms existing tools. PSSV has been successfully applied to breast cancer patient WGS data for identifying somatic SVs of key factors associated with breast cancer development. In this dissertation research, we demonstrate the advantage of the proposed distributional learning-based approaches over conventional methods for NGS data analysis. Distributional learning is a very powerful approach to gain biological insights from high quality NGS data. Successful applications of the proposed Bayesian methods to breast cancer NGS data shed light on underlying molecular mechanisms of breast cancer, enabling biologists or clinicians to identify major cancer drivers and develop new therapeutics for cancer treatment.
Ph. D.
APA, Harvard, Vancouver, ISO, and other styles
48

Thrush, Mariah A. "Analyzing Algal Diversity in Aquatic Systems Using Next Generation Sequencing." Ohio University Honors Tutorial College / OhioLINK, 2013. http://rave.ohiolink.edu/etdc/view?acc_num=ouhonors1366807717.

Full text
APA, Harvard, Vancouver, ISO, and other styles
49

Camerlengo, Terry Luke. "Techniques for Storing and Processing Next-Generation DNA Sequencing Data." The Ohio State University, 2014. http://rave.ohiolink.edu/etdc/view?acc_num=osu1388502159.

Full text
APA, Harvard, Vancouver, ISO, and other styles
50

Porter, Ashleigh Fay. "Next generation sequencing to explore microbial diversity, origins and evolution." Thesis, The University of Sydney, 2021. https://hdl.handle.net/2123/24919.

Full text
Abstract:
Emerging infectious diseases are major contributors to morbidity and mortality. To prevent such diseases from occurring in the future it is important to understand pathogen evolution and emergence. Unfortunately, we know little about the virosphere outside of clinically significant viruses, leaving the bulk of viral diversity unexplored. A key aim of my thesis was to reveal more of the unknown diversity of viruses, particularly in under-studied animal hosts. To this end I employed bulk RNA sequencing (“meta-transcriptomics”) to identify novel viruses in a range of hosts, including native Australian wildlife and invertebrate species, as well as from mining databases of short-read archives. The novel viruses discovered were associated with an array of viral families, including the Flaviviridae, Parvoviridae, Circoviridae, Nudiviridae, Polyomaviridae, and Herpesviridae, in turn expanding our knowledge of the diversity and evolutionary history of these families. I similarly used meta-transcriptomics to document the presence of viral, bacterial and eukaryotic parasite sequences in commonly used laboratory reagents. Additionally, I explored the evolutionary history of two important members of the family Poxviridae: variola virus and the vaccinia virus. Accordingly, I described the historical context of smallpox, with an emphasis on the initial outbreak in Australia, and used ancient DNA techniques to reveal the origins and evolutionary history of the poxviruses used in early vaccination campaigns. Broadly, this thesis has expanded our knowledge of the diversity of viruses and revealed the evolutionary history of viral families that have a major impact on human and animal health. By increasing our knowledge of viral diversity, my work provides important new insights into their ecology and evolution, particularly the transmission of viruses to new host species that underpins disease emergence.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography