Dissertations / Theses: 'High throughput sequencing (NGS)'

1

Kawalia, Amit [Verfasser], Peter [Gutachter] Nürnberg, and Michael [Gutachter] Nothnagel. "Addressing NGS Data Challenges: Efficient High Throughput Processing and Sequencing Error Detection / Amit Kawalia ; Gutachter: Peter Nürnberg, Michael Nothnagel." Köln : Universitäts- und Stadtbibliothek Köln, 2016. http://d-nb.info/112370368X/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Bisseux, Maxime. "Dynamique de la circulation des Entérovirus de l'homme à l'environnement : Etude par séquençage haut débit." Thesis, Université Clermont Auvergne‎ (2017-2020), 2017. http://www.theses.fr/2017CLFAS013.

Full text

Abstract:

Les entérovirus (EV) sont des Picornavirus (virus nus à génome ARN positif), caractérisés par une grande diversité génétique et antigénique (116 types classés en 4 espèces taxonomiques EV-A à D) et une évolution rapide. Les infections humaines sont très fréquentes, hautement contagieuses à partir des selles et épidémiques. La plupart des infections sont asymptomatiques ou bénignes ; elles peuvent être graves voire mortelles, en particulier chez les jeunes enfants. La poliomyélite, modèle d’infection à EV, est en voie d’éradication grâce aux programmes de vaccination et de surveillance sous l’égide de l’OMS. La détection de poliovirus sauvages dans des pays déclarés exempts de polio depuis plusieurs années et l’émergence récente de plusieurs EV non poliomyélitiques (EV-A71, EV-D68) associés à des manifestations cliniques sévères dans plusieurs régions du monde montrent l’importance de surveiller la circulation des EV dans la population humaine. Le but de la thèse était de rechercher et caractériser les EV dans les eaux usées de l’agglomération de Clermont-Ferrand et de comparer les données à celles de la surveillance clinique pour avoir une image plus complète de la circulation virale dans la population générale. Une méthode de concentration virale à partir des eaux usées prélevées en entrée (eaux usées brutes) et sortie (eaux usées traitées) de station d’épuration a été mise au point, permettant la détection moléculaire des EV et de 6 autres virus entériques humains. La présence de génomes viraux a été détectée dans tous les échantillons d’octobre 2014 à octobre 2015, avec une médiane de 6 virus différents en entrée de station et de 4 virus en sortie. L’analyse phylogénétique des séquences d’EV et des virus des hépatites A et E présents dans les eaux usées et les prélèvements cliniques des patients hospitalisés au CHU de Clermont-Ferrand pendant la même période, a validé l’approche mise en place pour surveiller la circulation communautaire d’un virus entérique. La diversité des EV présents dans les eaux usées brutes a été analysée par séquençage d’amplicons avec une technique haut débit Illumina (metabarcoding). Les résultats montrent la présence d’une grande diversité d’EV et la circulation silencieuse de 25 types (notamment 9 EV-C, dont des séquences de poliovirus 1 vaccinal) dans la population générale. L’analyse phylogénétique des variants intra-typiques a mis en évidence plusieurs profils épidémiques parmi les principaux types ayant circulé pendant la période d’étude. Les données obtenues montrent la faisabilité et la sensibilité de la stratégie développée pour détecter et caractériser les EV présents dans les eaux usées. Ils permettent de discuter la place de la surveillance environnementale dans la surveillance des infections à EV non polio (études épidémiologiques, prévention des épidémies, alertes sanitaires). Surveiller conjointement les virus entériques dans l’environnement et chez les patients permet une meilleure compréhension de leur prévalence. Cette approche globale de la circulation virale et de l’écologie de la santé représente un engagement important de la part des laboratoires et nécessitera une intégration dans des réseaux structurés de collaboration nationales et internationales dépassant la seule surveillance des EV
Enterovirus (EV) are Picornaviruses (non-enveloped, positive-sense RNA viruses), characterized by a large genetic and antigenic diversity (116 types classified within 4 taxonomic species EV-A to D) and rapid evolution. Human infections are frequent, highly contagious from stools and occur as outbreaks. The infections are mainly asymptomatic or benign but severe or fatal cases can be reported in young children. Poliomyelitis is the model EV infection. Combined with clinical and virological surveillance, mass vaccination is closer than ever to achieve the WHO program of the Global Polio Eradication Initiative. However, the detection of wild type polioviruses in polio-free countries and the recent worldwide emergence of non-polio enteroviruses (EV-A71, EV-D68) associated with severe clinical manifestations underscore the importance of surveilling EV circulation in the general population. The aim of the PhD thesis was the detection and identification of EV strains in wastewater treated in the sewage treatment plant at Clermont-Ferrand (France). The viral data were compared with those reported through clinical surveillance to obtain a comprehensive picture of the viral circulation in the local population. A method was developed to concentrate viruses from raw and treated wastewater and molecular assays were used to detect EVs and 6 other human enteric viruses. The viral genomes were detected in all samples from October 2014 to October 2015, with a median of 6 and 4 different viruses in raw and treated wastewater respectively. Phylogenetic analysis of viral sequences (EV, hepatitis A and E viruses) determined in wastewater and reported in patients during the sampling period, showed the efficiency of the method for surveilling enteric viruses in the community. The EV diversity in raw wastewater was analyzed by sequencing of amplicons with the Illumina high throughput technology (metabarcoding). The analysis revealed a large viral diversity and the silent circulation of 25 types not detected from hospital data (in particular 9 EV-C, of which sequences of vaccine poliovirus 1). The phylogenetic analyses of intra-typic variants showed different epidemic patterns in the predominant EV types circulating over the study period. The data demonstrate the feasibility and sensitivity of the strategy developed for the detection and characterization of EV in wastewater and provide a future prospect for the implementation of environmental surveillance of non-polio EV infections in epidemiological studies, epidemic prevention, and for health alert. Combining the surveillance of enteric viruses in the environment and in the clinical setting allows a better understanding of their prevalence. This global approach of virus circulation and ecological health represents an important investment for laboratories, which will require integration in national and international collaboration networks beyond the scope of enterovirus surveillance

APA, Harvard, Vancouver, ISO, and other styles

3

Nemoz, Benjamin. "Exploration longitudinale à haut débit et en cellule unique du répertoire d'anticorps neutralisants à large spectre chez un neutraliseur d'élite du VIH-1." Electronic Thesis or Diss., Université Grenoble Alpes, 2024. http://www.theses.fr/2024GRALV012.

Full text

Abstract:

L'infection par le virus de l'immunodéficience humaine de type 1 (VIH-1) reste un problème majeur de santé publique à l'échelle mondiale, avec environ 37,7 millions de personnes vivant avec le virus et de nouvelles contaminations dépassant le million de cas par an. Des antirétroviraux efficaces permettent maintenant de traiter durablement les personnes infectées. Ces thérapies contribuent également à améliorer la prévention et à ralentir la progression de l'épidémie. Cependant, un vaccin reste nécessaire, en particulier pour contrôler l'épidémie dans les régions à faible revenu et les environnements précaires.Le rôle protecteur des anticorps neutralisants (AcN) a été démontré sans équivoque dans les modèles animaux d'infection par le VIH et chez l'homme. Par conséquent, le développement d'un vaccin visant à la production, par les cellules B, d'anticorps (Ac) capables de neutraliser la majorité des virus en circulation, à savoir des AcN à large spectre (AcNLS), pourrait être envisagé comme une réponse à la pandémie de VIH.L'étude du développement des AcNLS chez certains individus, dénommés neutraliseurs d’élite du VIH-1, fournit des informations précieuses pour la conception de tels vaccins. Jusqu'à présent, la plupart des études entreprises se sont appuyées sur le tri conventionnel de cellules B uniques par cytométrie en flux (FACS) pour isoler les AcNLS. Dans la présente étude, nous avons utilisé l'approche "Chromium Single Cell Immune Profiling" à haut débit sur cellules uniques (scRNA-seq) pour réaliser une exploration longitudinale du répertoire des cellules B chez un neutraliseur d'élite du VIH-1. Cette méthode permet d'utiliser comme appâts pour l'identification des cellules B spécifiques un nombre beaucoup plus important de glycoprotéines d’enveloppe (Env) du VIH par rapport aux approches d'isolement d'Ac basées sur le FACS, ce qui permet d'obtenir une analyse plus complète du répertoire en Ac anti-Env. En outre, cette approche fournit une multitude d'informations sur la nature des Ac spécifiques identifiés et sur les cellules B correspondantes.Notre étude a permis d'identifier la séquence de 12 130 anticorps spécifiques de la protéine Env du VIH. Des Ac de 39 lignées ont été produits et testés pour leurs capacités de neutralisation, révélant 21 lignées neutralisantes. Ces résultats démontrent la capacité de la méthode à explorer de vastes répertoires spécifiques d'antigènes à partir d'échantillons longitudinaux. L'activité neutralisante des Ac de quatre lignées récapitulait l'activité sérique du donneur, permettant de neutraliser 62,4 % d'un large panel prédictif de 126 pseudovirus. Une de ces lignées neutralisantes ciblait la région riche en mannose de la gp120. Par ailleurs, les Ac de cette lignée étaient sensibles à la présence d'un glycane en position N332. Un seul de ces Ac était responsable de la plus grande partie de cette neutralisation (51,1 %) avec une activité à faible concentration (IC50 moyenne de 91,1 ng.mL-1). Cet Ac possède un CDRH3 de 23 AA de long et 20 % d'hypermutation somatique (SMH). La lignée a montré une maturation continue sur 6,5 ans, avec des taux de SMH observés de 2,0 % à 30,6 % pour la chaîne lourde, sans insertion ou délétion.Un tri conventionnel basé sur la méthode FACS avait été utilisé précédemment pour isoler des AcNLS du même donneur. En comparaison, l'approche scRNA-seq a permis d'isoler des Ac en nombre bien supérieur. En outre, les AcN nouvellement isolés étaient globalement plus neutralisants et de plus large spectre que ceux isolés précédemment, ce qui indique la supériorité de la nouvelle méthode pour l'identification de lignées neutralisantes. Les études structurales en cours permettront d'élucider les épitopes responsables de la neutralisation observée chez ce donneur. L'ensemble de ces résultats pourrait contribuer à la conception d'approches de "vaccinologie inverse", qui représentent à l'heure actuelle un espoir pour la mise au point d'un vaccin contre le VIH
Human Immunodeficiency Virus type 1 (HIV-1) infection remains a major global health concern, with an estimated 37.7 million people living with the virus worldwide and new contaminations above a million cases yearly. Efficient anti-retroviral therapies are available, allowing a sustained relief for infected individuals. These therapeutics have also contributed to a better prevention and helped curb the epidemic, notably in high-income countries. However, a vaccine is still highly awaited for controlling this epidemic, especially in lower-income regions and precarious settings.The protective role of neutralizing antibodies (NAbs) has been unequivocally demonstrated in both animal models of HIV infection and in human settings. Consequently, the development of a B-cell-based vaccine capable of eliciting antibodies (Abs) with the ability to neutralize the majority of circulating viruses, namely broadly NAbs (bNAbs), could be foreseen as an answer to the HIV pandemic.The investigation of bNAb development in HIV-1 elite neutralizers provides valuable insights to inform the design of such vaccines. To date, most of the undertaken studies have relied on conventional single B-cell FACS sorting to isolate bNAbs. In the present study, we have used the Chromium Single Cell Immune Profiling approach to conduct a high-throughput longitudinal single-cell exploration of the B-cell repertoire in an HIV-1 elite neutralizer. Importantly, this novel method enables the use of a much greater number of HIV envelope glycoprotein (Env) baits compared to regular FACS-based Ab isolation studies, providing a more comprehensive view of the anti-Env Ab repertoire. In addition, this approach yields a wealth of information on the nature of the specific Abs identified and the corresponding B-cells.The study enabled the uncovering of the sequence of 12,130 putative HIV Env specific Abs. Antibodies from 39 lineages were produced and tested for neutralization, revealing 21 distinct neutralizing lineages. The results thus demonstrated the ability of the method to explore large antigen-specific Ab repertoires from longitudinal samples. The neutralizing activity of Abs from four neutralizing lineages together recapitulated the serum activity of the donor, achieving neutralization against 62.4 % of a large predictive panel of 126 pseudoviruses. One of these neutralizing Ab lineages was shown to target the gp120 high-mannose patch supersite with great breadth and potency; Abs from this lineage were sensitive to the presence of a glycan in position N332. A single of those Abs achieved most of the neutralization breadth (51.1 %) with a high potency (mean IC50 of 91.1 ng.mL-1). This Ab exhibited a 23 AA-long CDRH3 and 20 % somatic hypermutation (SMH). The lineage showed continuous evolution over 6.5 years of maturation, with observed SHM rates ranging from 2.0 % to 30.6 % for the heavy chain, without any insertions or deletions.Conventional FACS-based sorting was previously used to isolate bNAbs from the same donor. In comparison, the single cell high-throughput approach made possible the isolation of orders of magnitude more Abs. Furthermore, the newly isolated NAbs were overall more potent and broader than those isolated previously, indicating the superiority of the novel method in recovering neutralizing lineages. Ongoing structural studies will elucidate the epitopes responsible for the broad neutralization observed in this donor. Together, the findings may help the design of reverse vaccine approaches, which show promise in the development of an effective AIDS vaccine

APA, Harvard, Vancouver, ISO, and other styles

4

Horton, Dean J. "Using molecular techniques to investigate soil invertebrate communities in temperate forests." Kent State University / OhioLINK, 2015. http://rave.ohiolink.edu/etdc/view?acc_num=kent1448799316.

Full text

APA, Harvard, Vancouver, ISO, and other styles

5

Roguski, Łukasz 1987. "High-throughput sequencing data compression." Doctoral thesis, Universitat Pompeu Fabra, 2017. http://hdl.handle.net/10803/565775.

Full text

Abstract:

Thanks to advances in sequencing technologies, biomedical research has experienced a revolution over recent years, resulting in an explosion in the amount of genomic data being generated worldwide. The typical space requirement for storing sequencing data produced by a medium-scale experiment lies in the range of tens to hundreds of gigabytes, with multiple files in different formats being produced by each experiment. The current de facto standard file formats used to represent genomic data are text-based. For practical reasons, these are stored in compressed form. In most cases, such storage methods rely on general-purpose text compressors, such as gzip. Unfortunately, however, these methods are unable to exploit the information models specific to sequencing data, and as a result they usually provide limited functionality and insufficient savings in storage space. This explains why relatively basic operations such as processing, storage, and transfer of genomic data have become a typical bottleneck of current analysis setups. Therefore, this thesis focuses on methods to efficiently store and compress the data generated from sequencing experiments. First, we propose a novel general purpose FASTQ files compressor. Compared to gzip, it achieves a significant reduction in the size of the resulting archive, while also offering high data processing speed. Next, we present compression methods that exploit the high sequence redundancy present in sequencing data. These methods achieve the best compression ratio among current state-of-the-art FASTQ compressors, without using any external reference sequence. We also demonstrate different lossy compression approaches to store auxiliary sequencing data, which allow for further reductions in size. Finally, we propose a flexible framework and data format, which allows one to semi-automatically generate compression solutions which are not tied to any specific genomic file format. To facilitate data management needed by complex pipelines, multiple genomic datasets having heterogeneous formats can be stored together in configurable containers, with an option to perform custom queries over the stored data. Moreover, we show that simple solutions based on our framework can achieve results comparable to those of state-of-the-art format-specific compressors. Overall, the solutions developed and described in this thesis can easily be incorporated into current pipelines for the analysis of genomic data. Taken together, they provide grounds for the development of integrated approaches towards efficient storage and management of such data.
Gràcies als avenços en el camp de les tecnologies de seqüenciació, en els darrers anys la recerca biomèdica ha viscut una revolució, que ha tingut com un dels resultats l'explosió del volum de dades genòmiques generades arreu del món. La mida típica de les dades de seqüenciació generades en experiments d'escala mitjana acostuma a situar-se en un rang entre deu i cent gigabytes, que s'emmagatzemen en diversos arxius en diferents formats produïts en cada experiment. Els formats estàndards actuals de facto de representació de dades genòmiques són en format textual. Per raons pràctiques, les dades necessiten ser emmagatzemades en format comprimit. En la majoria dels casos, aquests mètodes de compressió es basen en compressors de text de caràcter general, com ara gzip. Amb tot, no permeten explotar els models d'informació especifícs de dades de seqüenciació. És per això que proporcionen funcionalitats limitades i estalvi insuficient d'espai d'emmagatzematge. Això explica per què operacions relativament bàsiques, com ara el processament, l'emmagatzematge i la transferència de dades genòmiques, s'han convertit en un dels principals obstacles de processos actuals d'anàlisi. Per tot això, aquesta tesi se centra en mètodes d'emmagatzematge i compressió eficients de dades generades en experiments de sequenciació. En primer lloc, proposem un compressor innovador d'arxius FASTQ de propòsit general. A diferència de gzip, aquest compressor permet reduir de manera significativa la mida de l'arxiu resultant del procés de compressió. A més a més, aquesta eina permet processar les dades a una velocitat alta. A continuació, presentem mètodes de compressió que fan ús de l'alta redundància de seqüències present en les dades de seqüenciació. Aquests mètodes obtenen la millor ratio de compressió d'entre els compressors FASTQ del marc teòric actual, sense fer ús de cap referència externa. També mostrem aproximacions de compressió amb pèrdua per emmagatzemar dades de seqüenciació auxiliars, que permeten reduir encara més la mida de les dades. En últim lloc, aportem un sistema flexible de compressió i un format de dades. Aquest sistema fa possible generar de manera semi-automàtica solucions de compressió que no estan lligades a cap mena de format específic d'arxius de dades genòmiques. Per tal de facilitar la gestió complexa de dades, diversos conjunts de dades amb formats heterogenis poden ser emmagatzemats en contenidors configurables amb l'opció de dur a terme consultes personalitzades sobre les dades emmagatzemades. A més a més, exposem que les solucions simples basades en el nostre sistema poden obtenir resultats comparables als compressors de format específic de l'estat de l'art. En resum, les solucions desenvolupades i descrites en aquesta tesi poden ser incorporades amb facilitat en processos d'anàlisi de dades genòmiques. Si prenem aquestes solucions conjuntament, aporten una base sòlida per al desenvolupament d'aproximacions completes encaminades a l'emmagatzematge i gestió eficient de dades genòmiques.

APA, Harvard, Vancouver, ISO, and other styles

6

Mozere, M. "High-throughput sequencing analysis pipeline." Thesis, University College London (University of London), 2016. http://discovery.ucl.ac.uk/1528797/.

Full text

Abstract:

High-throughput sequencing methods were developed to increase the productivity of processing data from genomic DNA. Sequencing platforms are generating massive amounts of genetic variation data which makes it difficult to pinpoint a small subset of functionally important variants. The focus has now shifted from generating sequences to searching for the critical differences that separate normal variants from disease ones. Our High-throughput Sequencing Analysis Pipeline (HSAP) is a multistep analysis software designed to annotate and filter variants in a top-down fashion from Variant Calling Format (VCF) files in order to find disease causing variants in the patients. It is designed in Linux medium and is composed of a collection of interacting task-specific modules written in different programming languages (such as Python, C++) and shell scripts. Each module is designed to perform a specific task, such as: annotate variants with their functional characterisation, zygosity status, allele frequencies within population; filter variants depending on the inherited disease model, read depth, call quality, physical location and other criteria. The output is added to the universal VCF format file, which contains annotated and filtered genomic variants. The pipeline was verified by identifying/confirming a specific disease-causing mutation for a single-gene disorder. HSAP is designed as an open-source locally self-contained bootable software that uses only information from publicly available databases. It has a user-friendly offline web-interface that allows to select different modules and chain them together to create unique filtering arrangements in order to adapt the pipeline as needed.

APA, Harvard, Vancouver, ISO, and other styles

7

Durif, Ghislain. "Multivariate analysis of high-throughput sequencing data." Thesis, Lyon, 2016. http://www.theses.fr/2016LYSE1334/document.

Full text

Abstract:

L'analyse statistique de données de séquençage à haut débit (NGS) pose des questions computationnelles concernant la modélisation et l'inférence, en particulier à cause de la grande dimension des données. Le travail de recherche dans ce manuscrit porte sur des méthodes de réductions de dimension hybrides, basées sur des approches de compression (représentation dans un espace de faible dimension) et de sélection de variables. Des développements sont menés concernant la régression "Partial Least Squares" parcimonieuse (supervisée) et les méthodes de factorisation parcimonieuse de matrices (non supervisée). Dans les deux cas, notre objectif sera la reconstruction et la visualisation des données. Nous présenterons une nouvelle approche de type PLS parcimonieuse, basée sur une pénalité adaptative, pour la régression logistique. Cette approche sera utilisée pour des problèmes de prédiction (devenir de patients ou type cellulaire) à partir de l'expression des gènes. La principale problématique sera de prendre en compte la réponse pour écarter les variables non pertinentes. Nous mettrons en avant le lien entre la construction des algorithmes et la fiabilité des résultats.Dans une seconde partie, motivés par des questions relatives à l'analyse de données "single-cell", nous proposons une approche probabiliste pour la factorisation de matrices de comptage, laquelle prend en compte la sur-dispersion et l'amplification des zéros (caractéristiques des données single-cell). Nous développerons une procédure d'estimation basée sur l'inférence variationnelle. Nous introduirons également une procédure de sélection de variables probabiliste basée sur un modèle "spike-and-slab". L'intérêt de notre méthode pour la reconstruction, la visualisation et le clustering de données sera illustré par des simulations et par des résultats préliminaires concernant une analyse de données "single-cell". Toutes les méthodes proposées sont implémentées dans deux packages R: plsgenomics et CMF
The statistical analysis of Next-Generation Sequencing data raises many computational challenges regarding modeling and inference, especially because of the high dimensionality of genomic data. The research work in this manuscript concerns hybrid dimension reduction methods that rely on both compression (representation of the data into a lower dimensional space) and variable selection. Developments are made concerning: the sparse Partial Least Squares (PLS) regression framework for supervised classification, and the sparse matrix factorization framework for unsupervised exploration. In both situations, our main purpose will be to focus on the reconstruction and visualization of the data. First, we will present a new sparse PLS approach, based on an adaptive sparsity-inducing penalty, that is suitable for logistic regression to predict the label of a discrete outcome. For instance, such a method will be used for prediction (fate of patients or specific type of unidentified single cells) based on gene expression profiles. The main issue in such framework is to account for the response to discard irrelevant variables. We will highlight the direct link between the derivation of the algorithms and the reliability of the results. Then, motivated by questions regarding single-cell data analysis, we propose a flexible model-based approach for the factorization of count matrices, that accounts for over-dispersion as well as zero-inflation (both characteristic of single-cell data), for which we derive an estimation procedure based on variational inference. In this scheme, we consider probabilistic variable selection based on a spike-and-slab model suitable for count data. The interest of our procedure for data reconstruction, visualization and clustering will be illustrated by simulation experiments and by preliminary results on single-cell data analysis. All proposed methods were implemented into two R-packages "plsgenomics" and "CMF" based on high performance computing

APA, Harvard, Vancouver, ISO, and other styles

8

Langenberger, David. "High-throughput sequencing and small non-coding RNAs." Doctoral thesis, Universitätsbibliothek Leipzig, 2013. http://nbn-resolving.de/urn:nbn:de:bsz:15-qucosa-112876.

Full text

Abstract:

In this thesis the processing mechanisms of short non-coding RNAs (ncRNAs) is investigated by using data generated by the current method of high-throughput sequencing (HTS). The recently adapted short RNA-seq protocol allows the sequencing of RNA fragments of microRNA-like length (∼18-28nt). Thus, after mapping the data back to a reference genome, it is possible to not only measure, but also visualize the expression of all ncRNAs that are processed to fragments of this specific length. Short RNA-seq data was used to show that a highly abundant class of small RNAs, called microRNA-offset-RNAs (moRNAs), which was formerly detected in a basal chordate, is also produced from human microRNA precursors. To simplify the search, the blockbuster tool that automatically recognizes blocks of reads to detect specific expression patterns was developed. By using blockbuster, blocks from moRNAs were detected directly next to the miR or miR* blocks and could thus easily be registered in an automated way. When further investigating the short RNA-seq data it was realized that not only microRNAs give rise to short ∼22nt long RNA pieces, but also almost all other classes of ncRNAs, like tRNAs, snoRNAs, snRNAs, rRNAs, Y-RNAs, or vault RNAs. The formed read patterns that arise after mapping these RNAs back to a reference genome seem to reflect the processing of each class and are thus specific for the RNA transcripts of which they are derived from. The potential of this patterns in classification and identification of non-coding RNAs was explored. Using a random forest classifier which was trained on a set of characteristic features of the individual ncRNA classes, it was possible to distinguish three types of ncRNAs, namely microRNAs, tRNAs, and snoRNAs. To make the classification available to the research community, the free web service ‘DARIO’ that allows to study short read data from small RNA-seq experiments was developed. The classification has shown that read patterns are specific for different classes of ncRNAs. To make use of this feature, the tool deepBlockAlign was developed. deepBlockAlign introduces a two-step approach to align read patterns with the aim of quickly identifying RNAs that share similar processing footprints. In order to find possible exceptions to the well-known microRNA maturation by Dicer and to identify additional substrates for Dicer processing the small RNA sequencing data of a Dicer knockdown experiment in MCF-7 cells was re-evaluated. There were several Dicer-independent microRNAs, among them the important tumor supressor mir-663a. It is known that many aspects of the RNA maturation leave traces in RNA sequencing data in the form of mismatches from the reference genome. It is possible to recover many well- known modified sites in tRNAs, providing evidence that modified nucleotides are a pervasive phenomenon in these data sets.

APA, Harvard, Vancouver, ISO, and other styles

9

Zhang, Xuekui. "Mixture models for analysing high throughput sequencing data." Thesis, University of British Columbia, 2011. http://hdl.handle.net/2429/35982.

Full text

Abstract:

The goal of my thesis is to develop methods and software for analysing high-throughput sequencing data, emphasizing sonicated ChIP-seq. For this goal, we developed a few variants of mixture models for genome-wide profiling of transcription factor binding sites and nucleosome positions. Our methods have been implemented into Bioconductor packages, which are freely available to other researchers. For profiling transcription factor binding sites, we developed a method, PICS, and implemented it into a Bioconductor package. We used a simulation study to confirm that PICS compares favourably to rival methods, such as MACS, QuEST, CisGenome, and USeq. Using published GABP and FOXA1 data from human cell lines, we then show that PICS predicted binding sites were more consistent with computationally predicted binding motifs than the alternative methods. For motif discovery using transcription binding sites, we combined PICS with two other existing packages to create the first complete set of Bioconductor tools for peak-calling and binding motif analysis of ChIP-Seq and ChIP-chip data. We demonstrate the effectiveness of our pipeline on published human ChIP-Seq datasets for FOXA1, ER, CTCF and STAT1, detecting co-occurring motifs that were consistent with the literature but not detected by other methods. For nucleosome positioning, we modified PICS into a method called PING. PING can handle MNase-Seq and MNase- or sonicated-ChIP-Seq data. It compares favourably to NPS and TemplateFilter in scalability, accuracy and robustness to low read density. To demonstrate that PING predictions from sonicated data can have sufficient spatial resolution to be biologically meaningful, we use H3K4me1 data to detect nucleosome shifts, discriminate functional and non-functional transcription factor binding sites, and confirm that Foxa2 associates with the accessible major groove of nucleosomal DNA. All of the above uses single-end sequencing data. At the end of the thesis, we briefly discuss the issue of processing paired-end data, which we are currently investigating.

APA, Harvard, Vancouver, ISO, and other styles

10

Roberts, Adam. "Ambiguous fragment assignment for high-throughput sequencing experiments." Thesis, University of California, Berkeley, 2014. http://pqdtopen.proquest.com/#viewpdf?dispub=3616509.

Full text

Abstract:

As the cost of short-read, high-throughput DNA sequencing continues to fall rapidly, new uses for the technology have been developed aside from its original purpose in determining the genome of various species. Many of these new experiments use the sequencer as a digital counter for measuring biological activities such as gene expression (RNA-Seq) or protein binding (ChIP-Seq).

A common problem faced in the analysis of these data is that of sequenced fragments that are "ambiguous", meaning they resemble multiple loci in a reference genome or other sequence. In early analyses, such ambiguous fragments were ignored or were assigned to loci using simple heuristics. However, statistical approaches using maximum likelihood estimation have been shown to greatly improve the accuracy of downstream analyses and have become widely adopted Optimization based on the expectation-maximization (EM) algorithm are often employed by these methods to find the optimal sets of alignments, with frequent enhancements to the model. Nevertheless, these improvements increase complexity, which, along with an exponential growth in the size of sequencing datasets, has led to new computational challenges.

Herein, we present our model for ambiguous fragment assignment for RNA-Seq, which includes the most comprehensive set of parameters of any model introduced to date, as well as various methods we have explored for scaling our optimization procedure. These methods include the use of an online EM algorithm and a distributed EM solution implemented on the Spark cluster computing system. Our advances have resulted in the first efficient solution to the problem of fragment assignment in sequencing.

Furthermore, we are the first to create a fully generalized model for ambiguous fragment assignment and present details on how our method can provide solutions for additional high-throughput sequencing assays including ChIP-Seq, Allele-Specific Expression (ASE), and the detection of RNA-DNA Differences (RDDs) in RNA-Seq.

APA, Harvard, Vancouver, ISO, and other styles

11

Hoffmann, Steve. "Genome Informatics for High-Throughput Sequencing Data Analysis." Doctoral thesis, Universitätsbibliothek Leipzig, 2014. http://nbn-resolving.de/urn:nbn:de:bsz:15-qucosa-152643.

Full text

Abstract:

This thesis introduces three different algorithmical and statistical strategies for the analysis of high-throughput sequencing data. First, we introduce a heuristic method based on enhanced suffix arrays to map short sequences to larger reference genomes. The algorithm builds on the idea of an error-tolerant traversal of the suffix array for the reference genome in conjunction with the concept of matching statistics introduced by Chang and a bitvector based alignment algorithm proposed by Myers. The algorithm supports paired-end and mate-pair alignments and the implementation offers methods for primer detection, primer and poly-A trimming. In our own benchmarks as well as independent bench- marks this tool outcompetes other currently available tools with respect to sensitivity and specificity in simulated and real data sets for a large number of sequencing protocols. Second, we introduce a novel dynamic programming algorithm for the spliced alignment problem. The advantage of this algorithm is its capability to not only detect co-linear splice events, i.e. local splice events on the same genomic strand, but also circular and other non-collinear splice events. This succinct and simple algorithm handles all these cases at the same time with a high accuracy. While it is at par with other state- of-the-art methods for collinear splice events, it outcompetes other tools for many non-collinear splice events. The application of this method to publically available sequencing data led to the identification of a novel isoform of the tumor suppressor gene p53. Since this gene is one of the best studied genes in the human genome, this finding is quite remarkable and suggests that the application of our algorithm could help to identify a plethora of novel isoforms and genes. Third, we present a data adaptive method to call single nucleotide variations (SNVs) from aligned high-throughput sequencing reads. We demonstrate that our method based on empirical log-likelihoods automatically adjusts to the quality of a sequencing experiment and thus renders a \"decision\" on when to call an SNV. In our simulations this method is at par with current state-of-the-art tools. Finally, we present biological results that have been obtained using the special features of the presented alignment algorithm
Diese Arbeit stellt drei verschiedene algorithmische und statistische Strategien für die Analyse von Hochdurchsatz-Sequenzierungsdaten vor. Zuerst führen wir eine auf enhanced Suffixarrays basierende heuristische Methode ein, die kurze Sequenzen mit grossen Genomen aligniert. Die Methode basiert auf der Idee einer fehlertoleranten Traversierung eines Suffixarrays für Referenzgenome in Verbindung mit dem Konzept der Matching-Statistik von Chang und einem auf Bitvektoren basierenden Alignmentalgorithmus von Myers. Die vorgestellte Methode unterstützt Paired-End und Mate-Pair Alignments, bietet Methoden zur Erkennung von Primersequenzen und zum trimmen von Poly-A-Signalen an. Auch in unabhängigen Benchmarks zeichnet sich das Verfahren durch hohe Sensitivität und Spezifität in simulierten und realen Datensätzen aus. Für eine große Anzahl von Sequenzierungsprotokollen erzielt es bessere Ergebnisse als andere bekannte Short-Read Alignmentprogramme. Zweitens stellen wir einen auf dynamischer Programmierung basierenden Algorithmus für das spliced alignment problem vor. Der Vorteil dieses Algorithmus ist seine Fähigkeit, nicht nur kollineare Spleiß- Ereignisse, d.h. Spleiß-Ereignisse auf dem gleichen genomischen Strang, sondern auch zirkuläre und andere nicht-kollineare Spleiß-Ereignisse zu identifizieren. Das Verfahren zeichnet sich durch eine hohe Genauigkeit aus: während es bei der Erkennung kollinearer Spleiß-Varianten vergleichbare Ergebnisse mit anderen Methoden erzielt, schlägt es die Wettbewerber mit Blick auf Sensitivität und Spezifität bei der Vorhersage nicht-kollinearer Spleißvarianten. Die Anwendung dieses Algorithmus führte zur Identifikation neuer Isoformen. In unserer Publikation berichten wir über eine neue Isoform des Tumorsuppressorgens p53. Da dieses Gen eines der am besten untersuchten Gene des menschlichen Genoms ist, könnte die Anwendung unseres Algorithmus helfen, eine Vielzahl weiterer Isoformen bei weniger prominenten Genen zu identifizieren. Drittens stellen wir ein datenadaptives Modell zur Identifikation von Single Nucleotide Variations (SNVs) vor. In unserer Arbeit zeigen wir, dass sich unser auf empirischen log-likelihoods basierendes Modell automatisch an die Qualität der Sequenzierungsexperimente anpasst und eine \"Entscheidung\" darüber trifft, welche potentiellen Variationen als SNVs zu klassifizieren sind. In unseren Simulationen ist diese Methode auf Augenhöhe mit aktuell eingesetzten Verfahren. Schließlich stellen wir eine Auswahl biologischer Ergebnisse vor, die mit den Besonderheiten der präsentierten Alignmentverfahren in Zusammenhang stehen

APA, Harvard, Vancouver, ISO, and other styles

12

Duggett, Nicholas A. "High-throughput sequencing of the chicken gut microbiome." Thesis, University of Birmingham, 2016. http://etheses.bham.ac.uk//id/eprint/6678/.

Full text

Abstract:

The chicken (\(Gallus\) \(gallus\) \(domesticus\)) is the most abundant and widely distributed livestock animal with a global population of over 21 bill ion. A newly hatched broiler chick increases its body weight by 25% overnight and 50-fold over five weeks. The symbiotic, complex and variable community of the microbiome forms an important part of the gastrointestinal tract (gut). It is involved in gut development, biochemistry, immunology, physiology and non-specific resistance to infection. This study investigated the chicken gut microbiota using high-throughput 16S rRNA sequencing and culture-based techniques. There was specific interest in the proventriculus of which there is limited research currently in the literature and the caecum because it contains the highest density of bacterial cells in the gut at 10\(^1\)\(^1\) per gram. The results showed no significant difference in the first stages of the gut which shared a low-diversity microbiota dominated by a few \(Lactobacillus\) species. The microbiota becomes more diverse in the latter pa1ts of the small intestine where \(C/ostridiales\) and \(Enterobacteriaceae\) were present in higher numbers. The caecum was the most diverse organ with the majority of species belonging to Ruminococcaceae, Lachnospiraceae and \(Alistipes\). A number of novel species were isolated from the chicken gut and six of these were whole-genome sequenced.

APA, Harvard, Vancouver, ISO, and other styles

13

Chiang, HyoJin Rosaria. "Examination of mammalian microRNAs by high-throughput sequencing." Thesis, Massachusetts Institute of Technology, 2011. http://hdl.handle.net/1721.1/65289.

Full text

Abstract:

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Biology, 2011.
Cataloged from PDF version of thesis.
Includes bibliographical references.
Small non-coding RNAs play an important role in a wide range of cellular events. MicroRNAs (miRNAs) are an abundant class of small RNAs that post-transcriptionally repress expression of their target genes. Since miRNA targeting is based on its sequence, accurate and comprehensive annotation of miRNA genes is fundamental to understanding miRNA gene regulation. Advances in high-throughput sequencing technology have led to discoveries of novel small RNA genes and identifications of their properties. We describe a method for construction of small-RNA library for Illumina sequencing platform that improves upon previous efforts. Sequencing data from small-RNA libraries constructed using this protocol can be used to profile small RNAs from a broad range of samples. In particular, we sequenced 60 million small RNAs from mouse brain, ovary, testes, embryonic stem cells, three embryonic stages, and whole newborns. The analysis of the data provide a substantially revised list of confidently identified murine miRNAs, thereby providing a more accurate picture of the general features of mammalian miRNAs and their abundance in the genome. In addition, our results revealed new aspects of miRNA biogenesis and modification, including tissue-specific strand preferences, sequential Dicer cleavage of a metazoan pre-miRNA, cases of consequential 5' heterogeneity, newly identified instances of miRNA editing, and widespread pre-miRNA uridylation reminiscent of Lin28-like miRNA regulation.
by HyoJin Rosaria Chiang.
Ph.D.

APA, Harvard, Vancouver, ISO, and other styles

14

Stromberg, Michael Peter. "Enabling high-throughput sequencing data analysis with MOSAIK." Thesis, Boston College, 2010. http://hdl.handle.net/2345/1332.

Full text

Abstract:

Thesis advisor: Gabor T. Marth
During the last few years, numerous new sequencing technologies have emerged that require tools that can process large amounts of read data quickly and accurately. Regardless of the downstream methods used, reference-guided aligners are at the heart of all next-generation analysis studies. I have developed a general reference-guided aligner, MOSAIK, to support all current sequencing technologies (Roche 454, Illumina, Applied Biosystems SOLiD, Helicos, and Sanger capillary). The calibrated alignment qualities calculated by MOSAIK allow the user to fine-tune the alignment accuracy for a given study. MOSAIK is a highly configurable and easy-to-use suite of alignment tools that is used in hundreds of labs worldwide. MOSAIK is an integral part of our genetic variant discovery pipeline. From SNP and short-INDEL discovery to structural variation discovery, alignment accuracy is an essential requirement and enables our downstream analyses to provide accurate calls. In this thesis, I present three major studies that were formative during the development of MOSAIK and our analysis pipeline. In addition, I present a novel algorithm that identifies mobile element insertions (non-LTR retrotransposons) in the human genome using split-read alignments in MOSAIK. This algorithm has a low false discovery rate (4.4 %) and enabled our group to be the first to determine the number of mobile elements that differentially occur between any two individuals
Thesis (PhD) — Boston College, 2010
Submitted to: Boston College. Graduate School of Arts and Sciences
Discipline: Biology

APA, Harvard, Vancouver, ISO, and other styles

15

Xing, Zhengrong. "Poisson multiscale methods for high-throughput sequencing data." Thesis, The University of Chicago, 2016. http://pqdtopen.proquest.com/#viewpdf?dispub=10195268.

Full text

Abstract:

In this dissertation, we focus on the problem of analyzing data from high-throughput sequencing experiments. With the emergence of more capable hardware and more efficient software, these sequencing data provide information at an unprecedented resolution. However, statistical methods developed for such data rarely tackle the data at such high resolutions, and often make approximations that only hold under certain conditions.

We propose a model-based approach to dealing with such data, starting from a single sample. By taking into account the inherent structure present in such data, our model can accurately capture important genomic regions. We also present the model in such a way that makes it easily extensible to more complicated and biologically interesting scenarios.

Building upon the single-sample model, we then turn to the statistical question of detecting differences between multiple samples. Such questions often arise in the context of expression data, where much emphasis has been put on the problem of detecting differential expression between two groups. By extending the framework for a single sample to incorporate additional group covariates, our model provides a systematic approach to estimating and testing for such differences. We then apply our method to several empirical datasets, and discuss the potential for further applications to other biological tasks.

We also seek to address a different statistical question, where the goal here is to perform exploratory analysis to uncover hidden structure within the data. We incorporate the single-sample framework into a commonly used clustering scheme, and show that our enhanced clustering approach is superior to the original clustering approach in many ways. We then apply our clustering method to a few empirical datasets and discuss our findings.

Finally, we apply the shrinkage procedure used within the single-sample model to tackle a completely different statistical issue: nonparametric regression with heteroskedastic Gaussian noise. We propose an algorithm that accurately recovers both the mean and variance functions given a single set of observations, and demonstrate its advantages over state-of-the art methods through extensive simulation studies.

APA, Harvard, Vancouver, ISO, and other styles

16

de, Lange Katrina Melanie. "Understanding inflammatory bowel disease using high-throughput sequencing." Thesis, University of Cambridge, 2017. https://www.repository.cam.ac.uk/handle/1810/265370.

Full text

Abstract:

For over two decades, the study of genetics has been making significant progress towards understanding the causes of common disease. Across a wide range of complex disorders there have been hundreds of associated loci identified, largely driven by common genetic variation. Now, with the advent of next-generation sequencing technology, we are able to interrogate rare and low frequency variation in a high throughput manner for the first time. This provides an exciting opportunity to investigate the role of rarer variation in complex disease risk on a genome-wide scale, potentially o↵ering novel insights into the biological mechanisms underlying disease pathogenesis. In this thesis I will assess the potential of this technology to further our understanding of the genetics of complex disease, using inflammatory bowel disease (IBD) as an example. After first reviewing the history of genetic studies into IBD, I will describe the analytical challenges that can occur when using sequencing to perform case-control association testing at scale, and the methods that can be used to overcome these. I then test for novel IBD associations in a low coverage whole genome sequencing dataset, and uncover a significant burden of rare, damaging missense variation in the gene NOD2, as well as a more general burden of such variation amongst known inflammatory bowel disease risk genes. Through imputation into both new and existing genotyped cohorts, I also describe the discovery of 26 novel IBD-associated loci, including a low frequency missense variant in ADCY7 that approximately doubles the risk of ulcerative colitis. I resolve biological associations underlying several of these novel associations, including a number of signals associated with monocyte-specific changes in integrin gene expression following immune stimulation. These results reveal important insights into the genetic architecture of inflammatory bowel disease, and suggest that a combination of continued array-based genome- wide association studies, imputed using substantial new reference panels, and large scale deep sequencing projects will be required in order to fully understand the genetic basis of complex diseases like IBD.

APA, Harvard, Vancouver, ISO, and other styles

17

Schwartz, Jerrod Joseph. "Technologies for high throughput single molecule DNA sequencing /." May be available electronically:, 2009. http://proquest.umi.com/login?COPT=REJTPTU1MTUmSU5UPTAmVkVSPTI=&clientId=12498.

Full text

APA, Harvard, Vancouver, ISO, and other styles

18

Siragusa, Enrico [Verfasser]. "Approximate string matching for high-throughput sequencing / Enrico Siragusa." Berlin : Freie Universität Berlin, 2015. http://d-nb.info/1074404882/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

19

Keebler, Jonathan Edward Myers. "Spontaneous Mutation Discovery via High-Throughput Sequencing of Pedigrees." NCSU, 2010. http://www.lib.ncsu.edu/theses/available/etd-03312010-151914/.

Full text

Abstract:

Recent technological advances have made high-throughput DNA sequencing a routine laboratory experiment. This progression in technology has been made possible by the parallel production of millions of short fragments of sequence. The responsibility of garnering biological information from these DNA fragments has shifted from the wet-lab to the bioinformatician. As sequencing technology is applied to a growing number of individual human genomes, entire families are now being sequenced. Information contained within the pedigree of a sequenced family can be leveraged when inferring the donorsâ genotypes, a task that is not necessarily trivial using high-throughput sequencing reads. A violation of Mendelian inheritance laws observed amid the resequenced genomes of family members can indicate the presence of a de novo mutation. A method for locating de novo mutations by probabilistically inferring genotypes across a pedigree using high-throughput sequencing is presented and applied to two resequenced nuclear families: one as a collaborative effort within The 1,000 Genomes Project, and the second in an attempt to discover candidate driver and passenger mutations within the genome of an Acute Lymphoblastic Leukemia. The mutation findings within these projects are presented, and the approach is examined in detail, highlighting areas where method improvements may be made. Considering the challenges experienced in these studies within the larger context of the nascent field of Personal Genomics, an honest assessment is presented of developments that must be made before the application of whole-genome sequencing on the scale of an individual human can unequivocally be used to predict, diagnose, or treat human disease.

APA, Harvard, Vancouver, ISO, and other styles

20

Weese, David [Verfasser]. "Indices and Applications in High-Throughput Sequencing / David Weese." Berlin : Freie Universität Berlin, 2013. http://d-nb.info/1036130150/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

21

Person, Kerry P. (Kerry Patrick). "Operational streamlining in a high-throughput genome sequencing center." Thesis, Massachusetts Institute of Technology, 2006. http://hdl.handle.net/1721.1/37248.

Full text

Abstract:

Thesis (M.B.A.)--Massachusetts Institute of Technology, Sloan School of Management; and, (S.M.)--Massachusetts Institute of Technology, Dept. of Chemical Engineering; in conjunction with the Leaders for Manufacturing Program at MIT, 2006.
Includes bibliographical references (p. 83-84).
Advances in medicine rely on accurate data that is rapidly provided. It is therefore critical for the Genome Sequencing platform of the Broad Institute of MIT and Harvard to continually strive to reduce cost, improve throughput, and increase the quality of its data output. In the past, new technology in the form of both chemistry improvements and robotics has allowed the Institute to achieve these goals in a step-wise manner. However, as the rate of technology progression in sequencing has slowed, the Institute has been forced to look to continuous, incremental improvement in order to achieve its goals. The Core Sequencing/Detection group handles the high-throughput sequencing duties at the Broad Institute. Through the use of robotics and cutting edge biology, they are able to process and sequence upwards of 50 billion bases of DNA per year. The work that this thesis was based on took place primarily in this automated production area. This thesis utilizes a number of lean concepts, including the 7 Wastes and pull production control.
(cont.) Kanban systems, workflow changes, and a 5S implementation were used to bring these concepts to life at the Broad Institute. In order to correctly size the kanban system, process buildup diagrams and discrete event simulation were used. Each of these tools helped to drive the process towards the Institute's goals of reducing cost and improving quality and throughput.
by Kerry P. Person.
S.M.
M.B.A.

APA, Harvard, Vancouver, ISO, and other styles

22

Fritz, Markus Hsi-Yang. "Exploiting high throughput DNA sequencing data for genomic analysis." Thesis, University of Cambridge, 2012. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.610819.

Full text

APA, Harvard, Vancouver, ISO, and other styles

23

Woolford, Julie Ruth. "Statistical analysis of small RNA high-throughput sequencing data." Thesis, University of Cambridge, 2012. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.610375.

Full text

APA, Harvard, Vancouver, ISO, and other styles

24

Pérez, Cantalapiedra Carlos. "Accessing genetic variability in Spanish barleys through high-throughput sequencing." Doctoral thesis, Universitat Autònoma de Barcelona, 2016. http://hdl.handle.net/10803/399850.

Full text

Abstract:

L'ordi és un cultiu important a la regió mediterrània, caracteritzada per precipitacions escasses i irregulars. A la Península Ibèrica, ha estat conreat durant milers d'anys, permeten l’aparició d’adaptacions específiques a l’estrès. Aquestes característiques, presents en les varietats locals espanyoles, romanen sense ser explotades en la millora de cereals. La seqüenciació d'alt rendiment (HTS, per les sigles en anglès) ha revolucionat la investigació fent possible la seqüenciació dels genomes de múltiples organismes. El mapa físic de l'ordi, amb seqüències associades, va ser publicat a finals de 2012. Per treure partit d'aquests recursos, calia facilitar-ne l'accés a genetistes i milloradors. Aquest va ser l'objectiu que ens va portar a desenvolupar Barleymap, una eina informàtica que permet localitzar marcadors genètics en el genoma de l’ordi. Aquesta aplicació integra i localitza marcadors de diferents plataformes de genotipat d'ordi àmpliament utilitzades. Un altre avantatge de la HTS és que es poden dur a terme diferents tipus d'experiments amb diferents objectius d'investigació. Nosaltres fem servir la seqüenciació de l’exoma pel mapeig fi d'un QTL de resistència a l’oïdi d'una varietat local espanyola. A partir d'una gran població de mapeig, vam ser capaços de delimitar la posició del QTL a un contig físic. A més, vam poder identificar i ensamblar parcialment un gen candidat que s'expressa. Per aconseguir això, una sèrie aproximacions bioinformàtiques van ser aplicades per diferenciar la variació de presència-absència en un grup de gens de la família NBS-LRR. Una altra aplicació poderosa de la HTS és RNAseq, que permet seqüenciar transcriptomes complets, i dur a terme assajos d'expressió amb una resolució sense precedent. Ensamblem de novo els transcriptomes d'un cultivar d'ordi susceptible a sequera i d'una varietat local espanyola resistent. Comparem els canvis d'expressió, en fulles i inflorescències en desenvolupament d'ambdós genotips, sota tractaments de sequera. Es van revelar grans diferències en les seves respostes a estrès. La comparació amb altres treballs de sequera en ordi, i l'anàlisi dels factors de transcripció i elements reguladors implicats va proporcionar noves dades sobre la complexa xarxa d'expressió gènica d'ordi sota estrès. En resum, la HTS aporta moltes noves possibilitats. Per aprofitar-la totalment, s'ha de fomentar la col·laboració de bioinformàtics i genetistes, per adaptar els nous recursos genòmics a les necessitats específiques.
La cebada es un cultivo importante en la región mediterránea, caracterizada por escasas e irregulares precipitaciones. En la Península Ibérica, ha sido cultivada durante miles de años, surgiendo adaptaciones específicas a estrés. Estas características, presentes en las variedades locales españolas, permanecen sin ser explotadas en mejora. La secuenciación de alto rendimiento (HTS, por sus siglas en inglés) ha revolucionado la investigación. Ha hecho posible secuenciar los genomas de múltiples organismos. El mapa físico de cebada, con secuencias asociadas, fue publicado a finales de 2012. Para sacar partido de estos recursos, había que facilitar el acceso a dicho recurso a genetistas y mejoradores. Este fue el objetivo que nos llevó a desarrollar Barleymap, una herramienta informática que permite localizar marcadores genéticos en el genoma de cebada. La aplicación integra y localiza marcadores de distintas plataformas de genotipado de cebada ampliamente utilizadas. Otra ventaja de la HTS es que se pueden llevar a cabo distintos tipos de experimentos con distintos objetivos de investigación. Nosotros utilizamos la secuenciación del exoma para mapeo fino de un QTL de resistencia a oidio de una variedad local española. A partir de una gran población de mapeo, fuimos capaces de acotar la posición del QTL a un solo contig físico. Además, pudimos identificar, y ensamblar parcialmente, un gene candidato que se expresa. Para conseguir esto, una serie de enfoques bioinformáticos fueron aplicados para diferenciar variación de presencia-ausencia, en un grupo de genes relacionados de la familia NBS-LRR. Otra aplicación poderosa de la HTS es RNAseq, que permite secuenciar transcriptomas completos, y llevar a cabo ensayos de expresión con una resolución sin precedente. Ensamblamos de novo los transcriptomas de un cultivar de cebada susceptible a sequía y de una variedad local española resistente. Comparamos los cambios de expresión, en hojas e inflorescencias en desarrollo de ambos genotipos, bajo tratamientos de sequía. Se revelaron grandes diferencias en sus respuestas a estrés. La comparación con otros trabajos de sequía en cebada, y el análisis de los factores de transcripción y elementos reguladores implicados proporcionó nuevos datos sobre la compleja red de expresión génica de cebada bajo estrés. En resumen, la HTS trae muchas nuevas posibilidades. Para aprovecharla totalmente, se debe fomentar colaboración de bioinformáticos y genetistas, para adaptar los nuevos recursos genómicos a necesidades específicas.
Barley is an important crop in the Mediterranean region, characterized by scarce and irregular rainfalls. In the Iberian Peninsula, it has been cultivated for thousands of years, leading to specific adaptations to prevalent biotic and abiotic stresses. These features, present in Spanish barley landraces, remain to be exploited in breeding. High-throughput sequencing (HTS) has revolutionized plant research. It has made it possible to sequence the genomes of multiple organisms. The sequence-enriched physical map of barley was published in late 2012. A first step to exploit barley genomics, for practical purposes, was facilitating geneticists and breeders access to the barley physical map. This was the aim which led us to the development of Barleymap, a software tool which allows locating genetic markers in the barley physical-genetic map. This application effectively integrates and maps markers from different widely used barley genotyping platforms, and, in general, any marker with sequence information. Another advantage of HTS is that diverse experimental setups can be used with different research objectives. Here, we used exome sequencing to fine-map a powdery mildew resistance QTL from a Spanish barley landrace. Exploiting a large mapping population, we were able to narrow down the position of the QTL to a single physical contig. Moreover, we could identify, and partially assemble, an expressed candidate gene. To achieve this, an array of bioinformatics approaches was applied to differentiate presence-absence variation, within a cluster of closely related genes of the NBS-LRR family. Another powerful application of HTS is RNAseq, which allows sequencing whole transcriptomes, and gene expression assays can be performed with unprecedented power. We de novo assembled the transcriptomes of a drought susceptible elite barley cultivar and a drought resistant Spanish barley landrace. Then, we compared the expression changes, in leaves and developing inflorescences from both genotypes, under drought treatments. This revealed large differences in their responses to stress. A comparison with other drought gene expression studies on barley, and an analysis of transcription factors and cis¬-regulatory elements involved, provided new insights into the complex barley gene expression network under stress. In summary, HTS has brought many new possibilities to plant research. To take full advantage of it, crosstalk between bioinformatics and genetics must be fostered to adapt the new genomic resources to specific needs.

APA, Harvard, Vancouver, ISO, and other styles

25

Kircher, Martin. "Understanding and improving high-throughput sequencing data production and analysis." Doctoral thesis, Universitätsbibliothek Leipzig, 2011. http://nbn-resolving.de/urn:nbn:de:bsz:15-qucosa-71102.

Full text

Abstract:

Advances in DNA sequencing revolutionized the field of genomics over the last 5 years. New sequencing instruments make it possible to rapidly generate large amounts of sequence data at substantially lower cost. These high-throughput sequencing technologies (e.g. Roche 454 FLX, Life Technology SOLiD, Dover Polonator, Helicos HeliScope and Illumina Genome Analyzer) make whole genome sequencing and resequencing, transcript sequencing as well as quantification of gene expression, DNA-protein interactions and DNA methylation feasible at an unanticipated scale. In the field of evolutionary genomics, high-throughput sequencing permitted studies of whole genomes from ancient specimens of different hominin groups. Further, it allowed large-scale population genetics studies of present-day humans as well as different types of sequence-based comparative genomics studies in primates. Such comparisons of humans with closely related apes and hominins are important not only to better understand human origins and the biological background of what sets humans apart from other organisms, but also for understanding the molecular basis for diseases and disorders, particularly those that affect uniquely human traits, such as speech disorders, autism or schizophrenia. However, while the cost and time required to create comparative data sets have been greatly reduced, the error profiles and limitations of the new platforms differ significantly from those of previous approaches. This requires a specific experimental design in order to circumvent these issues, or to handle them during data analysis. During the course of my PhD, I analyzed and improved current protocols and algorithms for next generation sequencing data, taking into account the specific characteristics of these new sequencing technologies. The presented approaches and algorithms were applied in different projects and are widely used within the department of Evolutionary Genetics at the Max Planck Institute of Evolutionary Anthropology. In this thesis, I will present selected analyses from the whole genome shotgun sequencing of two ancient hominins and the quantification of gene expression from short-sequence tags in five tissues from three primates.

APA, Harvard, Vancouver, ISO, and other styles

26

Anandhakumar, Chandran. "Advancing Synthetic Gene Regulators Development with High-Throughput Sequencing Technologies." 京都大学 (Kyoto University), 2015. http://hdl.handle.net/2433/202663.

Full text

APA, Harvard, Vancouver, ISO, and other styles

27

Mohamadi, Hamid. "Parallel algorithms and software tools for high-throughput sequencing data." Thesis, University of British Columbia, 2017. http://hdl.handle.net/2429/62072.

Full text

Abstract:

With growing throughput and dropping cost of High-Throughput Sequencing (HTS) technologies, there is a continued need to develop faster and more cost-effective bioinformatics solutions. However, the algorithms and computational power required to efficiently analyze HTS data have lagged considerably. In health and life sciences research organizations, de novo assembly and sequence alignment have become two key steps in everyday research and analysis. The de novo assembly process is a fundamental step in analyzing previously uncharacterized organisms and is one of the most computationally demanding problems in bioinformatics. The sequence alignment is a fundamental operation in a broad spectrum of genomics projects. In genome resequencing projects, they are often used prior to variant calling. In transcriptome resequencing, they provide information on gene expression. They are even used in de novo sequencing projects to help contiguate assembled sequences. As such designing efficient, scalable, and accurate solutions for de novo assembly and sequence alignment problems would have a wide effect in the field. In this thesis, I present a collection of novel algorithms and software tools for the analysis of high-throughput sequencing data using efficient data structures. I also utilize the latest advances in parallel and distributed computing to design and develop scalable and cost-effective algorithms on High-Performance Computing (HPC) infrastructures especially for the de novo assembly and sequence alignment problems. The algorithms and software solutions I develop are publicly available for free for academic use, to facilitate research at health and life sciences laboratories and other organizations worldwide.
Science, Faculty of
Graduate

APA, Harvard, Vancouver, ISO, and other styles

28

Stokowy, Tomasz, Markus Eszlinger, Michał Świerniak, Krzysztof Fujarewicz, Barbara Jarząb, Ralf Paschke, and Kurt Krohn. "Analysis options for high-throughput sequencing in miRNA expression profiling." Universitätsbibliothek Leipzig, 2014. http://nbn-resolving.de/urn:nbn:de:bsz:15-qucosa-144393.

Full text

Abstract:

Background: Recently high-throughput sequencing (HTS) using next generation sequencing techniques became useful in digital gene expression profiling. Our study introduces analysis options for HTS data based on mapping to miRBase or counting and grouping of identical sequence reads. Those approaches allow a hypothesis free detection of miRNA differential expression. Methods: We compare our results to microarray and qPCR data from one set of RNA samples. We use Illumina platforms for microarray analysis and miRNA sequencing of 20 samples from benign follicular thyroid adenoma and malignant follicular thyroid carcinoma. Furthermore, we use three strategies for HTS data analysis to evaluate miRNA biomarkers for malignant versus benign follicular thyroid tumors. Results: High correlation of qPCR and HTS data was observed for the proposed analysis methods. However, qPCR is limited in the differential detection of miRNA isoforms. Moreover, we illustrate a much broader dynamic range of HTS compared to microarrays for small RNA studies. Finally, our data confirm hsa-miR-197-3p, hsa-miR-221-3p, hsa-miR-222-3p and both hsa-miR-144-3p and hsa-miR-144-5p as potential follicular thyroid cancer biomarkers. Conclusions: Compared to microarrays HTS provides a global profile of miRNA expression with higher specificity and in more detail. Summarizing of HTS reads as isoform groups (analysis pipeline B) or according to functional criteria (seed analysis pipeline C), which better correlates to results of qPCR are promising new options for HTS analysis. Finally, data opens future miRNA research perspectives for HTS and indicates that qPCR might be limited in validating HTS data in detail.

APA, Harvard, Vancouver, ISO, and other styles

29

Ainsworth, David. "Computational approaches for metagenomic analysis of high-throughput sequencing data." Thesis, Imperial College London, 2016. http://hdl.handle.net/10044/1/44070.

Full text

Abstract:

High-throughput DNA sequencing has revolutionised microbiology and is the foundation on which the nascent field of metagenomics has been built. This ability to cheaply sample billions of DNA reads directly from environments has democratised sequencing and allowed researchers to gain unprecedented insights into diverse microbial communities. These technologies however are not without their limitations: the short length of the reads requires the production of vast amounts of data to ensure all information is captured. This 'data deluge' has been a major bottleneck and has necessitated the development of new algorithms for analysis. Sequence alignment methods provide the most information about the composition of a sample as they allow both taxonomic and functional classification but algorithms are prohibitively slow. This inefficiency has led to the reliance on faster algorithms which only produce simple taxonomic classification or abundance estimation, losing the valuable information given by full alignments against annotated genomes. This thesis will describe k-SLAM, a novel ultra-fast method for the alignment and taxonomic classification of metagenomic data. Using a k -mer based method k-SLAM achieves speeds three orders of magnitude faster than current alignment based approaches, allowing a full taxonomic classification and gene identification to be tractable on modern large datasets. The alignments found by k-SLAM can also be used to find variants and identify genes, along with their nearest taxonomic origins. A novel pseudo-assembly method produces more specific taxonomic classifications on species which have high sequence identity within their genus. This provides a significant (up to 40%) increase in accuracy on these species. Also described is a re-analysis of a Shiga-toxin producing E. coli O104:H4 isolate via alignment against bacterial and viral species to find antibiotic resistance and toxin producing genes. k-SLAM has been used by a range of research projects including FLORINASH and is currently being used by a number of groups.

APA, Harvard, Vancouver, ISO, and other styles

30

Wan, Ji. "Global analysis of alternative polyadenylation regulation using high-throughput sequencing." Diss., University of Iowa, 2012. https://ir.uiowa.edu/etd/3548.

Full text

Abstract:

Messenger RNAs (mRNAs) have to undergo a series of post-transcriptional processing steps before translation. One of the post-transcriptional steps - 3' end processing, which consists of cleavage and polyadenylation, is critical for delimiting the 3' end of mRNA and determining regulatory elements for downstream post-transcriptional/translational regulation. Like another well-characterized mRNA processing step - splicing, 3' end processing is very flexible due to the diversity of trans-acting factors and cis-acting elements in the 3' end of mRNA. In recent years, the differential usage of alternative polyA sites (APA) of the same gene, which leads to mRNA isoforms of different 3' UTR, has been increasingly revealed by both experimental and computational studies. More significantly, the global changes of 3' UTR length have been observed in multiple clinical settings, particularly in the cancer cells. However, the depiction of APA phenomenon does not synchronize the efforts to study the mechanism underlying APA biogenesis. In this thesis, we first describe general principle and pipeline to identify APA in different biological or clinical conditions using various high throughput sequencing techniques. After that, we present the work about the global impacts of two RNA binding proteins (ESRP/aCP) and one core 3' end processing factor (CstF64 and its paralog CstF64τ) on the regulation of APA. The APA identification analyses and motif analyses suggest a wide range of APA associated with the expression change of those proteins in different cell lines. In addition, for each protein, we have collect substantial evidence about the mechanism underlying the APA induction. Our findings could provide significant insights into the APA regulation mechanisms. In addition, we also conducted a research on the induction of APA in JEG-3 cells as a response to the change of oxygen supply (Hypoxia and Normoxia). Using a robustness protocol for specifically sequencing 3' end of mRNA, we identified more than 500 APA events and revealed a global shortening pattern of 3' UTR length as a result of hypoxia. The work on APA in this thesis largely increases the understanding of APA regulation by various proteins and provided new evidence for the APA in clinical condition.

APA, Harvard, Vancouver, ISO, and other styles

31

Mammana, Alessandro [Verfasser]. "Patterns and algorithms in high-throughput sequencing count data / Alessandro Mammana." Berlin : Freie Universität Berlin, 2016. http://d-nb.info/1108270956/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

32

Love, Michael I. [Verfasser]. "Statistical analysis of high-throughput sequencing count data / Michael I. Love." Berlin : Freie Universität Berlin, 2013. http://d-nb.info/1043197842/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

33

Wignall-Fleming, Elizabeth Bowie. "Investigations into the dynamics of paramyxovirus infections by high-throughput sequencing." Thesis, University of Glasgow, 2019. http://theses.gla.ac.uk/40905/.

Full text

Abstract:

The paramyxovirus family can cause a broad spectrum of diseases from mild febrile illnesses to more severe diseases that may require hospitalisation and can in the most serious cases have fatal outcomes. Understanding the virus infection dynamics is fundamental to the development of novel targets for therapeutic and vaccine development. The advancement of High-throughput sequencing (HTS) has revolutionised biomedical research providing unparalleled opportunities to answer complex questions. In this study we developed a workflow using directional analysis of HTS data to gain a unique opportunity to simultaneously analyse the kinetics of virus transcription and replication for PIV5 strain W3, PIV2, MuV and PIV3. The workflow could be used for the study of all negative strand viruses. The developed workflow was used to investigate a number of characteristics of paramyxovirus transcription including quantification of the transcription gradient, RNA editing resulting in the generation of non-templated mRNAs and the production of read-through mRNAs. Interestingly, the processivity of the RNA polymerase during transcription was shown to remain consistent throughout the infection amongst all of the viruses analysed. Additionally, virus replication and the generation of antigenomes were found to occur at early times post infection. This was surprising, as the current model for virus replication requires sufficient levels of NP to be present in the cytoplasm before the virus can enter replicative mode. These results suggest a revision of this model in which the virus produces local sites of virus transcription and replication in the cytoplasm known as foci and it is the level of NP surrounding the virus genomes at these local sites that dictates the virus ability to enter a replicative mode. PIV5 strain W3 was shown to supress virus gene expression at late times post infection resulting in the establishment of a persistent infection. The developed workflow was used to analyse the infection dynamics of PIV5. There were no changes in the RNA polymerase processivity of transcription that could account for the suppression of protein synthesis. A comparative analysis of PIV5 strains W3 and CPI+ identified a mutation of a serine to a phenylalanine at position 157 of the P protein in CPI+, a phosphorylation site that when phosphorylated by polo-like kinase 1 (PLK-1) was previously shown to play a role in the inhibition of virus RNA synthesis, that abolished the virus ability to supress protein synthesis and establish a persistent infection. This indicates that phosphorylation of serine at position 157 is responsible for the inhibition of virus gene expression and the establishment of persistence.

APA, Harvard, Vancouver, ISO, and other styles

34

Ballinger, Tracy J. "Analysis of genomic rearrangements in cancer from high throughput sequencing data." Thesis, University of California, Santa Cruz, 2015. http://pqdtopen.proquest.com/#viewpdf?dispub=3729995.

Full text

Abstract:

In the last century cancer has become increasingly prevalent and is the second largest killer in the United States, estimated to afflict 1 in 4 people during their life. Despite our long history with cancer and our herculean efforts to thwart the disease, in many cases we still do not understand the underlying causes or have successful treatments. In my graduate work, I’ve developed two approaches to the study of cancer genomics and applied them to the whole genome sequencing data of cancer patients from The Cancer Genome Atlas (TCGA). In collaboration with Dr. Ewing, I built a pipeline to detect retrotransposon insertions from paired-end high-throughput sequencing data and found somatic retrotransposon insertions in a fifth of cancer patients.

My second novel contribution to the study of cancer genomics is the development of the CN-AVG pipeline, a method for reconstructing the evolutionary history of a single tumor by predicting the order of structural mutations such as deletions, duplications, and inversions. The CN-AVG theory was developed by Drs. Haussler, Zerbino, and Paten and samples potential evolutionary histories for a tumor using Markov Chain Monte Carlo sampling. I contributed to the development of this method by testing its accuracy and limitations on simulated evolutionary histories. I found that the ability to reconstruct a history decays exponentially with increased breakpoint reuse, but that we can estimate how accurately we reconstruct a mutation event using the likelihood scores of the events. I further designed novel techniques for the application of CN-AVG to whole genome sequencing data from actual patients and applied these techniques to search for evolutionary patterns in glioblastoma multiforme using sequencing data from TCGA. My results show patterns of two-hit deletions, as we would expect, and amplifications occurring over several mutational events. I also find that the CN-AVG method frequently makes use of whole chromosome copy number changes following by localized deletions, a bias that could be mitigated through modifying the cost function for an evolutionary history.

APA, Harvard, Vancouver, ISO, and other styles

35

Sibthorp, Christopher. "Analysis of the Aspergillus nidulans transcriptome using high-throughput RNA sequencing." Thesis, University of Liverpool, 2012. http://livrepository.liverpool.ac.uk/9973/.

Full text

Abstract:

The filamentous fungus, Aspergillus nidulans is a well-characterized model organism which has been used extensively for the study of eukaryotic cell biology and genetics over the past 60 years. The A. nidulans genome was sequenced in 2005, and various genome annotations have been released since, the majority of which rely heavily on in silico gene prediction. The development of high-throughput next generation sequencing technologies has revolutionised transcriptomics by allowing RNA-analysis of whole transcriptomes through massively parallel cDNA sequencing (RNA-seq). This sequencing approach has been applied to the A. nidulans transcriptome, and augmented by the development of a novel strategy for selectively sequencing the 5′ ends of RNAs on the ABI SOLiD platform. This aimed to produce a more robust resource for gene interrogation and the investigation of regulatory elements which impact on the transcriptomal landscape in A. nidulans. Bioinformatic analysis RNA-seq data was used to define 15,375 transcription start site (TSS) regions, which have been characterised by statistical analysis of mapped 5′ end distribution. Motif finding within sequence regions surrounding these TSS identified 16 putative functional promoter motifs based on overrepresentation and distributional analysis within promoters, and GO annotation found significant functional enrichment amongst genes associated with two of these motifs (AARARAAA and TTTYTTY). Transcript assembly of RNA-seq data has also revealed 16065 putative transcripts, 1112 of which were mapped to regions annotated as intergenic. From these transcripts we identified 38 strong candidates for novel protein coding genes (six of which contained non-canonical translation start sites), and over 400 additional transcripts containing putative coding regions. Separation of RNA-seq data in two sets of strand specific reads was shown to greatly increase the quality of transcript assembly and facilitated the identification of 2291 occurrences of sense:antisense overlap between assembled transcripts, four of which have been proven experimentally. Finally, assembled transcripts have been used to detect multiple transcript isoforms arising from alternative splicing events. 374 distinct loci were identified as the origins of alternatively spliced transcripts, and six of these were verified experimentally.

APA, Harvard, Vancouver, ISO, and other styles

36

Glaus, Peter. "Bayesian methods for gene expression analysis from high-throughput sequencing data." Thesis, University of Manchester, 2014. https://www.research.manchester.ac.uk/portal/en/theses/bayesian-methods-for-gene-expression-analysis-from-highthroughput-sequencing-data(cf9680e0-a3f2-4090-8535-a39f3ef50cc4).html.

Full text

Abstract:

We study the tasks of transcript expression quantification and differential expression analysis based on data from high-throughput sequencing of the transcriptome (RNA-seq). In an RNA-seq experiment subsequences of nucleotides are sampled from a transcriptome specimen, producing millions of short reads. The reads can be mapped to a reference to determine the set of transcripts from which they were sequenced. We can measure the expression of transcripts in the specimen by determining the amount of reads that were sequenced from individual transcripts. In this thesis we propose a new probabilistic method for inferring the expression of transcripts from RNA-seq data. We use a generative model of the data that can account for read errors, fragment length distribution and non-uniform distribution of reads along transcripts. We apply the Bayesian inference approach, using the Gibbs sampling algorithm to sample from the posterior distribution of transcript expression. Producing the full distribution enables assessment of the uncertainty of the estimated expression levels. We also investigate the use of alternative inference techniques for the transcript expression quantification. We apply a collapsed Variational Bayes algorithm which can provide accurate estimates of mean expression faster than the Gibbs sampling algorithm. Building on the results from transcript expression quantification, we present a new method for the differential expression analysis. Our approach utilizes the full posterior distribution of expression from multiple replicates in order to detect significant changes in abundance between different conditions. The method can be applied to differential expression analysis of both genes and transcripts. We use the newly proposed methods to analyse real RNA-seq data and provide evaluation of their accuracy using synthetic datasets. We demonstrate the advantages of our approach in comparisons with existing alternative approaches for expression quantification and differential expression analysis. The methods are implemented in the BitSeq package, which is freely distributed under an open-source license. Our methods can be accessed and used by other researchers for RNA-seq data analysis.

APA, Harvard, Vancouver, ISO, and other styles

37

Solayman, Md. "High-Throughput Sequencing Based Probing of Protein/RNA Structures and Functions." Thesis, Griffith University, 2022. http://hdl.handle.net/10072/416290.

Full text

Abstract:

The rapid advancement in sequencing chemistry, sequencing technologies, and bioinformatics has significantly increased the sequencing automation and lowered the cost. The applications of high-throughput sequencing (HTS) technologies are expanding from research laboratories to diagnostic clinics on a regular basis. Moreover, diverse methods used in epigenetics, proteomics, structure probing of macromolecules (DNA, RNA, and proteins) have been developed based on the HTS technology. This thesis describes the development of two novel techniques, high-throughput split-protein profiling (HiTS) and RNA solvent accessibility probing method (RL-Seq), broadening the applications of HTS technologies for probing protein/RNA structures and functions. Chapter 1 of the thesis provides an overview of the history of HTS technologies, available platforms, ongoing development in this field, and their diverse applications, particularly in the area of proteomics and RNA structure probing. In Chapter 2, we introduced the HiTS method that allowed fast identification of self- and assisted complementary positions of three antibiotic-resistant proteins (fosfomycin, fosA3; erythromycin, ermB; and chloramphenicol, catI resistant-proteins). The finding of suitable split sites in proteins is important because they are used as reporters in protein complementary assay (PCA) for studying protein-protein interactions in different organisms. However, only a small number of split-protein systems have been identified so far owing to manual, labourintensive optimization of the candidate genes. The proposed HiTS method employs transposon mutagenesis, conditional interaction of split fragments by rapamycin-regulated FRB-FKBP protein pairs, and deep sequencing for fast identification of self- and assisted complementary fragments, which are subsequently confirmed by low-throughput testing. In Chapter 3, we further applied the HiTS method on T7 RNA polymerase (T7 RNAP), a bacteriophage RNA polymerase, considering its importance in synthetic biology in addition to the PCA. We found that the newly developed HiTS method could also be applicable to T7 RNAP for locating suitable split sites for self-complementing variants. Several selfcomplementing variants were found and one with a stronger signal than the wild type one. In Chapter 4, in preparation of applying HTS technology to probe RNA solvent accessibility, we reviewed the available experimental and computational techniques for RNA solvent accessibility studies and identified existing research gaps. Current experimental approaches for studying RNA solvent accessibility include hydroxyl radical probing (HRF-Seq), light activated structural examination of RNA (LASER), and its modified versions (LASER-Seq, LASER-Map, and icLASER). The reactivity readouts of these methods are based on either the reverse transcriptase stop (RT-stop) at cleavage points or mutational profiling at adduct formation sites. These approaches rely on reverse transcriptase enzymes and random primers, which suffer from non-specific drop-off to create short truncated sequences, which successively lead to false-positive signals at probe-reactive sites. In Chapter 5, we proposed the RL-Seq (RtcB Ligation-Seq) method to overcome the abovementioned limitations of the existing approaches. The method is illustrated by measuring the solvent accessibility of Escherichia coli complete ribosomal complexes at the single-nucleotide resolution. In this method, unique properties of RtcB ligase were used to identify the probing sites by ligating a pre-defined 5′-OH end containing linker with the hydroxyl radicals cleavage generated 3′-P ends. The application of this method to ribosomal RNAs (23S, 16S, and 5S rRNAs) confirmed its ability to estimate solvent accessibility with high sensitivity (required low sequencing depth) and accuracy (strong correlation to structure-derived values). In addition, the pre-defined linker employed in this method allowed using of a fixed primer in reverse transcription reaction and significantly minimized the biases during subsequent PCR amplification. In Chapter 6, we discussed the future prospects of these HTS technology-based methods developed in this thesis.
Thesis (PhD Doctorate)
Doctor of Philosophy (PhD)
Institute for Glycomics
Full Text

APA, Harvard, Vancouver, ISO, and other styles

38

Paicu, Claudia. "miRNA detection and analysis from high-throughput small RNA sequencing data." Thesis, University of East Anglia, 2016. https://ueaeprints.uea.ac.uk/63738/.

Full text

Abstract:

Small RNAs (sRNAs) are a broad class of short regulatory non-coding RNAs. microRNAs (miRNAs) are a special class of -21-22 nucleotide sRNAs which are derived from a stable hairpin-like secondary structure. miRNAs have critical gene regulatory functions and are involved in many pathways including developmental timing, organogenesis and development in both plants and animals. Next generation sequencing (NGS) technologies, which are often used for identifying miRNAs, are continuously evolving, generating datasets containing millions of sRNAs, which has led to new challenges for the tools used to predict miRNAs from such data. There are several tools for miRNA detection from NGS datasets, which we review in this thesis, identifying a number of potential shortcomings in their algorithms. In this thesis, we present a novel miRNA prediction algorithm, miRCat2. Our algorithm is more robust to variations in sequencing depth due to the fact that it compares aligned sRNA reads to a random uniform distribution to detect peaks in the input dataset, using a new entropy-based approach. Then it applies filters based on the miRNA biogenesis on the read alignment and on the computed secondary structure. Results show that miRCat2 has a better specificity-sensitivity trade-off than similar tools, and its predictions also contains a larger percentage of sequences that are downregulated in mutants in the miRNA biogenesis pathway. This confirms the validity of novel predictions, which may lead to new miRNA annotations, expanding and contributing to the field of sRNA research.

APA, Harvard, Vancouver, ISO, and other styles

39

Bista, Iliana-Aglaia. "Defining a high throughput sequencing identification framework for freshwater ecosystem biomonitoring." Thesis, Bangor University, 2016. https://research.bangor.ac.uk/portal/en/theses/defining-a-high-throughput-sequencing-identification-framework-for-freshwater-ecosystem-biomonitoring(133e53f8-e300-495b-89e9-c1b3188d8acb).html.

Full text

Abstract:

Freshwater ecosystems are currently amongst the most threatened habitats due to high levels of anthropogenic stress and increasing efforts are required to monitor their status and assess aquatic biodiversity. Biomonitoring, which is the systematic measurement of the responses of aquatic biota to environmental stressors, is used to evaluate ecosystem status. Macroinvertebrates are commonly used organisms for ecosystem assessment, due to their numerous biomonitoring qualities, which qualify them as ecological indicators. Traditional taxonomy-based monitoring is labour intensive, which limits the throughput, and is often inefficient in providing species level identification, which limits the accuracy of detections. The introduction of molecular based methods for biomonitoring, especially when coupled with High Throughput Sequencing (HTS) applications, offers a step change in ecosystem monitoring. Here I tested the utility of DNA based applications for increasing the efficiency of freshwater ecosystem biomonitoring, using benthic macroinvertebrates as a target group. For the first part of this work, I used DNA barcoding of the Cytochrome Oxidase Subunit I (COI), from individual specimens, to populate a barcode reference library for 94 species of Trichoptera, Gastropoda and Chironomidae from the UK. Then, I used High Throughput Sequencing (HTS) methods to characterise diversity from complex environmental samples. First, I used metabarcoding of aqueous environmental DNA (eDNA) and community invertebrate samples (Chironomidae pupal exuviae), collected on regular intervals throughout a year, to identify diversity levels and temporal patterns of community variation on ecosystem-wide and group specific scales. Finally, I used a structured design of mock macroinvertebrate communities, of known biomass content, to perform a comparison between PCR-based metabarcoding of the COI gene and PCR-free shotgun sequencing of mitochondrial genomes (mito-metagenomics), and evaluate their efficiency for accurate characterisation of biomass content of bulk samples. Overall, HTS has demonstrated great potential for advancing biomonitoring efforts, allowing ecosystem scale diversity detection from non-invasive types of samples, such as eDNA, whilst moving into mito-metagenomic work could improve the field even further by improving quantitative abundance results on the community composition level.

APA, Harvard, Vancouver, ISO, and other styles

40

GIANGREGORIO, TANIA. "High throughput sequencing analysis for the molecular diagnosis of Inherited Thrombocytopenias." Doctoral thesis, Università degli Studi di Trieste, 2019. http://hdl.handle.net/11368/2962379.

Full text

Abstract:

Inherited thrombocytopenias are a heterogenous group of rare genetic disorders characterized by reduced platelet count sometimes combined with bleeding tendency and/or other clinical defects. The molecular diagnosis of ITs is essential to make clinical decision and infer personalized prognosis and risks. More than 30 genes have been identified that harbor mutations responsible for ITs (Balduini et al., 2017). In addition, ITs often show phenotypic overlaps that hamper the correct diagnosis with the traditional diagnostic algorithm based on step-wise specialized investigations. However, the advent of next generation sequencing has changed the diagnostic approach of diseases characterized by high genetic heterogeneity like ITs. In order to improve the diagnosis of IT, we designed a targeted next generation sequencing panel (IT-NGS) to screen the 28 genes more commonly mutated in ITs. Ninety-seven consecutive probands with a suspicious of ITs had been sequenced. The analysis led us to reach a definite diagnosis for 37 probands. In these probands we identified known or novel likely pathogenic mutations causing specific diseases, including monoallelic Bernard Soulier syndrome (N=14), biallelic Bernard Soulier syndrome (N=4), ACTN1-related thrombocytopenia (N=4), MYH9-related disease (N=7), ANKRD26-related thrombocytopenia (N=4), congenital amegakaryocytic thrombocytopenia (N=1), grey platelet syndrome (N=1), Wiskott-Aldrich syndrome (N=1) and Acute Myelogenous Leukemia (N=1). In another 34 cases we identified variants of uncertain significance (VUS) whose pathogenic role has to be supported by segregation analysis and in-depth functional studies. Since 17 probands had no potential candidate variant impacting IT-NGS genes, they are eligible for whole exome sequencing (WES) to clone novel genes involved in ITs. In conclusion, since some IT forms predispose to additional acquired disease during life, an accurate diagnosis is essential to infer personalized prognosis and define proper treatments and follow-up. Because of clinical and genetic heterogeneity, the molecular diagnosis of ITs represents a lengthy and expensive challenge using conventional technologies. The use of IT-NGS in clinical practice aided by specific investigations clarifying the role of variant of uncertain significance, overcomes these issues facilitating a definite diagnosis in patients with a suspicious of known ITs forms.

APA, Harvard, Vancouver, ISO, and other styles

41

Barquist, Lars. "High-throughput experimental and computational studies of bacterial evolution." Thesis, University of Cambridge, 2014. https://www.repository.cam.ac.uk/handle/1810/245138.

Full text

Abstract:

The work in this thesis is concerned with the study of bacterial adaptation on short and long timescales. In the first section, consisting of three chapters, I describe a recently developed high-throughput technology for probing gene function, transposon-insertion sequencing, and its application to the study of functional differences between two important human pathogens, Salmonella enterica subspecies enterica serovars Typhi and Typhimurium. In a first study, I use transposon-insertion sequencing to probe differences in gene requirements during growth on rich laboratory media, revealing differences in serovar requirements for genes involved in iron-utilization and cell-surface structure biogenesis, as well as in requirements for non-coding RNA. In a second study I more directly probe the genomic features responsible for differences in serovar pathogenicity by analyzing transposon-insertion sequencing data produced following a two hour infection of human macrophage, revealing large differences in the selective pressures felt by these two closely related serovars in the same environment. The second section, consisting of two chapters, uses statistical models of sequence variation, i.e. covariance models, to examine the evolution of intrinsic termination across the bacterial kingdom. A first collaborative study provides background and motivation in the form of a method for identifying Rho-independent terminators using covariance models built from deep alignments of experimentally-verified terminators from Escherichia coli and Bacillus subtilis. In the course of the development of this method I discovered a novel putative intrinsic terminator in Mycobacterium tuberculosis. In the final chapter, I extend this approach to de novo discovery of intrinsic termination motifs across the bacterial phylogeny. I present evidence for lineage-specific variations in canonical Rho-independent terminator composition, as well as discover seven non-canonical putative termination motifs. Using a collection of publicly available RNA-seq datasets, I provide evidence for the function of some of these elements as bona fide transcriptional attenuators.

APA, Harvard, Vancouver, ISO, and other styles

42

Ghazanfar, Shila. "Statistical approaches to harness high throughput sequencing data in diverse biological systems." Thesis, The University of Sydney, 2017. http://hdl.handle.net/2123/17268.

Full text

Abstract:

The development of novel statistical approaches to questions specific to biological systems of interest is becoming more valuable as we tackle increasingly complex problems. This thesis explores three distinct biological systems in which high throughput sequencing data is utilised, varying in research area, organism, number of sequencing platforms and datasets integrated, and structure such as matched samples; showcasing the variety of study designs and thus the need for tailored statistical approaches. First, we characterise allelic imbalance from RNA-Seq data including stringent filtering criteria and a count based likelihood ratio test. This work identified genes of particular importance in livestock genomics such as those related to energy use. Second, we outline a novel methodology to identify highly expressed genes and cells for single cell RNA-Seq data. We derive a gamma-normal mixture model to identify lowly and highly expressed components, and use this to identify novel markers for olfactory sensory neuron (OSN) maturity across publicly available mouse neuron datasets. In addition we estimate single cell networks and find that mature OSN single cell networks are more centralised than immature OSN single cell networks. Third, we develop two novel frameworks for relating information from Whole Exome DNA-Seq and RNA-Seq data when i) samples are matched and when ii) samples are not necessary matched between platforms. In the latter case, we relate functional somatic mutation driver gene scores to transcriptional network correlation disturbance using a permutation testing framework, identifying potential candidate genes for targeted therapies. In the former case, we estimate directed mutation-expression networks for each cancer using linear models, providing a useful exploratory tool for identifying novel relationships among genes. This thesis demonstrates the importance of tailored statistical approaches to further understanding across many biological systems.

APA, Harvard, Vancouver, ISO, and other styles

43

Esteve, Codina Anna. "Characterization of the Iberian pig genome and transcriptome using high throughput sequencing." Doctoral thesis, Universitat Autònoma de Barcelona, 2012. http://hdl.handle.net/10803/134673.

Full text

Abstract:

En aquesta tesis, hem estudiat els patrons de variabilidad nucleotídica del genoma del porc per entendre millor quines forces evolutives l’han afectat. El porc domèstic és una espècie domèstica que presenta una gran variabilitat fenotípica arrel del procés de domesticació i de la formació de races moderna. A més, el porc senglar i altres espècies pròximes encara, avui, estan vives, facilitant, així, la búsqueda de gens candidats que han sofert selecció artificial. El porc, és, també, important en el camp de la biomedicina, com a model de malalties humanes i com a reservori d’organs humans. En el primer capítol, hem volgut detectar si hi ha hagut selecció artificial en un possible gen candidat per la qualitat de la carn en porcs, la SERPINA6. Per això, hem estudiat la variabilitat nucleotídica de diversos porcs domèstics i senglars de diferents orígens (asiàtics i europeus). L’anàlisis realitzat, però, no ha estat concloent. En segon lloc, fent ús de les noves tècniques de seqüenciació nova generació, hem pogut estudiar la variabilitat nucleotídica, no només d’un cert gen, sinó de tot el genoma complet d’un porc Ibèric. A més, també, ens ha permès estudiar i caracteritzar el seu transcriptoma. Per dur-ho a terme, hem utilitzat diverses tècniques i metodologies complementàries: ‘whole genome sequencing’, ‘reduced representation libraries’, ‘pool sequencing’ and ‘transcriptome sequencing’. L’estimació de la variabilitat nucleotídica ha estat de 0.7kb-1, un valor gens negligible considerant l’alt coeficient de consanguinitat d’aquesta estirp de porcs. Hem observat, també, que els telòmers tenen una variabilitat més alta que els centròmers, fet que es pot explicar per una taxa de recombinació més alta. A més, el cromosoma X presenta una variabilitat molt més baixa de la esperada respecte als autosomes, causada, segurament per selecció o altres efectes demogràfics. Per estudiar regions en el genoma sota selecció, hem dividit el genoma en finestres no solapants i calculat diferents test de selecció, tant en un pool de porcs ibèrics, com en un sol individu. Les regions amb excés de polimorfisme i que per tant, podrien estar sota selecció balancejadora, estan enriquides en receptors olfactoris i gens del complex d’histocompatibilitat. En canvi, en regions amb excés de diferenciació i variabilitat molt baixa, no sembla que hi hagi un clar enriquiment en cap funció. De totes maneres, citem possibles gens candidats relacionats amb el metabolisme lipídic, la queratinització, la formació de pèls i el comportament. Per altra banda, les tècniques de seqüenciació massiva permeten també, detectar variants estructurals basant-se en els patrons del ‘read depth’. D’aquesta manera, hem pogut identificar guanys en el nombre de còpies de certes regions del genoma Ibèric respecte al genoma de referència. En total, hem trobat que 36 Mb del genoma estan afectades i que aproximadament un 5% de gens es troben dins aquestes regions. Així doncs, hem pogut identificar nous paràlogs de gens anotats; la majoria formant part de grans famílies gèniques. Finalment, hem comparat el transcriptoma de gònades masculines entre dos porcs amb fenotpis molt extrems, un d’Ibèric i un Large White. Els gens diferencialment expressats estan relacionats amb l’espermatogenesis i el metabolisme lipídic, acord amb les seves diferències fenotípiques. També hem pogut identificar nous gens no anotats, long-non-coding RNAs i elements de transposició expressats en aquest teixit.

APA, Harvard, Vancouver, ISO, and other styles

44

Okonechnikov, Konstantin [Verfasser]. "High-throughput RNA sequencing: a step forward in transcriptome analysis / Konstantin Okonechnikov." Berlin : Freie Universität Berlin, 2016. http://d-nb.info/1084634686/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

45

Rubelt, Florian [Verfasser]. "Investigations into the human immunoglobulin repertoire utilizing high-throughput sequencing / Florian Rubelt." Berlin : Freie Universität Berlin, 2012. http://d-nb.info/1030488894/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

46

Chao, Yuanqing, and 晁元卿. "Studies of biofilm development by advanced microscopic techniques and high-throughput sequencing." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2013. http://hub.hku.hk/bib/B50899922.

Full text

Abstract:

This study was conducted to investigate the biofilm formation by using advanced microscopic and high-throughput sequencing techniques. The major tasks were (1) to quantitatively evaluate the initial bacterial attachment processes by Atomic Force Microscopy (AFM); (2) to characterize the chemical variation during biofilm formation by Raman microscopy; (3) to analyze the microbial structure and functions in the wastewater and drinking water biofilms by metagenomic analysis. To determine the lateral detachment force for bacteria, a quantitative method using contact mode of AFM was developed. The established method had good repeatability and sensitivity to various bacteria and substrata, and was applied to evaluate the roles of bacterial surface polymers in Phase I and II attachment, i.e. lipopolysaccharides, type 1 fimbria and capsular colanic acid. The results indicated lipopolysaccharides largely enhanced Phases I and II attachment. Fimbriae increased Phase I attachment but not significantly influence the adhesion strength in Phase II. Moreover, colanic acid had negative effect on attachment in both of Phases I and II. Surface-enhanced Raman scattering was applied to evaluate the chemical components in the biofilm matrix at different growth phases, including initial attached bacteria, colonies and mature biofilm. Three model bacteria, including Escherichia coli, Pseudomonas putida, and Bacillus subtilis, were used to cultivate biofilms. The results showed that the content of carbohydrates, proteins, and nucleic acids in biofilm matrix increased significantly along with the biofilm growth of three bacteria judging from the intensities and appearance probabilities of related marker peaks in the spectra. The content of lipids, however, only increased in the Gram-negative biofilms. Moreover, metagenomic data, coupled with PCR-based 454 pyrosequencing reads, were generated for activated sludge and biofilm from a full-scale hybrid reactor to study the microbial taxonomic and functional differences/connections between activated sludge and biofilm. The results showed that the dominant bacteria co-existed in two samples. Global functions in activated sludge and biofilm metagenomes showed quite similar pattern, revealing the limited differences of overall functions existed in two samples. For nitrogen removal, the diversity and abundance of nitrifiers and denitrifiers in biofilm did not surpass that in activated sludge. Whilst, higher abundances of nitrification and denitrification genes were indeed found in biofilm, suggesting the increased nitrogen removal by applying biofilm might be attributed to removal efficiency rather than biomass accumulation of nitrogen removal bacteria. To investigate the bacterial structure and functions of drinking water biofilm, PCR-based 454 pyrosequencing of 16S rRNA gene and Illumina metagenomic data were generated and analyzed. Significant differences of bacterial diversity and taxonomic structure were found between biofilms formed on stainless steel and plastics. Moreover, ecological succession could be obviously observed during biofilm formation. The metabolic network analysis for drinking water biofilm constructed for the first time. Moreover, the occurrence and abundance of specific genes involving in the bacterial pathway of glutathione metabolism and production/degradation of extracellular polymeric substances were also evaluated.
published_or_final_version
Civil Engineering
Doctoral
Doctor of Philosophy

APA, Harvard, Vancouver, ISO, and other styles

47

Chen, Nanhua. "Application of high-throughput sequencing for the analyses of PRRSV-host interactions." Diss., Kansas State University, 2014. http://hdl.handle.net/2097/18664.

Full text

Abstract:

Doctor of Philosophy
Department of Diagnostic Medicine and Pathobiology
Raymond R. R. Rowland
Porcine Reproductive and Respiratory Syndrome Virus (PRRSV) is the most costly virus to the swine industry, worldwide. This study explored the application of deep sequencing techniques to understand better the virus-host interaction. On the virus side, PRRSV exists as a quasispecies. The first application of deep sequencing was to investigate amino acid substitutions in hypervariable regions during acute infection and after virus rebound. The appearance and disappearance of mutations, especially the generation of a new N-glycosylation site in GP5, indicated they are likely the result of immune selection. The second application of deep sequencing was to investigate the quasispecies makeup in pigs with severe combined immunodeficiency (SCID) that lack B and T cells. The results showed the same pattern of amino acid substitutions in SCID and normal littermates and no different mutations were identified between SCID and normal littermates. This suggests the mutations that appear during the early stages of infection are the product of the virus becoming adapted to replication in pigs. The third application of deep sequencing was to investigate the locations of recombination events between GFP-expressing PRRSV infectious clones. The results identified different cross-over occurred within three conserved regions between EGFP and GFPm genes. And finally, the fourth goal was applied to develop a set of sequencing tools for analyzing the host antibody repertoire. A simple method was developed to amplify swine VDJ repertoires. Shared and abundant VDJ sequences that are likely expressed by PRRSV-activated B cells were determined in pigs that had different neutralization activities. These sequences are potentially correlated with different antibody responses.

APA, Harvard, Vancouver, ISO, and other styles

48

Bellos, Evangelos. "Statistical methods for elucidating copy number variation in high-throughput sequencing studies." Thesis, Imperial College London, 2014. http://hdl.handle.net/10044/1/24867.

Full text

Abstract:

Copy number variation (CNV) is pervasive in the human genome and has been shown to contribute significantly to phenotypic diversity and disease aetiology. High-throughput sequencing (HTS) technologies have allowed for the systematic investigation of CNV at an unprecedented resolution. HTS studies offer multiple distinct features that can provide evidence for the presence of CNV. We have developed an integrative statistical framework that jointly analyses multiple sequencing features at the population level to achieve sensitive and precise discovery of CNV. First, we applied our framework to low-coverage whole-genome sequencing experiments and used data from the 1000 Genomes Project to demonstrate a substantial improvement in CNV detection accuracy over existing methods. Next, we extended our approach to targeted HTS experiments, which offer improved cost-efficiency by focusing on a predetermined subset of the genome. Targeted HTS involves an enrichment step that introduces non-uniformity in sequencing coverage across target regions and thus hinders CNV identification. To that end, we designed a customized normalization procedure that counteracts the effects of enrichment bias and enhances the underlying CNV signal. Our extended framework was benchmarked on contiguous capture datasets, where it was shown to outperform competing strategies by a wide margin. Capture sequencing can also generate large amounts of data in untargeted genomic regions. Although these off-target results can be a valuable source of CNV evidence, they are subject to complex enrichment patterns that confound their interpretation. Therefore, we developed the first normalization strategy that can adapt to the highly heterogeneous nature of off-target capture and thus facilitate CNV investigation in untargeted regions. All in all, we present a generalized CNV detection toolset that has been shown to achieve robust performance across datasets and sequencing platforms and can therefore provide valuable insight into the prevalence and impact of CNV.

APA, Harvard, Vancouver, ISO, and other styles

49

Oral, Münevver. "Insights into isogenic clonal fish line development using high-throughput sequencing technologies." Thesis, University of Stirling, 2016. http://hdl.handle.net/1893/24909.

Full text

Abstract:

Isogenic clonal fish lines are a powerful resource for aquaculture-related research. Fully inbred individuals, clone founders, can be produced either through mitotic gynogenesis or androgenesis and a further generation from those propagates fully inbred clonal lines. Despite rapid generation, as opposed to successive generation of sibling mating as in mice, the production of such lines may be hampered due to (i) potential residual contribution from irradiated gametes associated with poorly optimised protocols, (ii) reduced survival of clone founders and (iii) spontaneous arisal of meiotic gynogenetics with varying degree of heterozygosity, contaminating fully homozygous progenies. This research set out to address challenges and gain insights into isogenic clonal fish lines development by using double-digest RADseq (ddRADseq) to generate large numbers of genetic markers covering the genome of interest. Analysis of potential contribution from irradiated sperm indicated successful uniparental inheritance in meiotic and mitotic gynogenetics European seabass. Exclusive transmission of maternal alleles was detected in G1 progeny of Atlantic salmon (with a duplicated genome), while G2 progenies presented varying levels of sire contribution suggesting sub-optimal UV irradiation which was undetected previously with 27 microsatellite markers. Identification of telomeric markers in European seabass, with higher recombination frequencies for efficient differentiation of meiotic and mitotic gynogenetics was successful, and a genetic linkage map was generated from this data. One clear case of a spontaneous meiotic gynogenetic fish was detected among 18 putative DH fish in European seabass, despite earlier screening for isogenicity using 11 microsatellite markers. An unidentified larval DNA restriction digestion inhibition mechanism observed in Nile tilapia prevented the construction of SNP-based genetic linkage map. In summary, this study provides strong evidence on efficacy of NGS technologies for the development and verification of isogenic clonal fish lines. Reliable establishment of isogenic clonal fish lines is critical for their utility as a research tool.

APA, Harvard, Vancouver, ISO, and other styles

50

Beckers, Matthew. "Quality checking and expression analysis of high-throughput small RNA sequencing data." Thesis, University of East Anglia, 2015. https://ueaeprints.uea.ac.uk/58581/.

Full text

Abstract:

The advent of high-throughput RNA sequencing (RNA-seq) methods have made it possible to sequence transcriptomes for the cell-wide identi�cation of small non-coding RNAs (sRNAs) and to assess their regulation using di�erential expression analysis by comparing two or more di�erent conditions. During an analysis of a typical set of sRNA sequencing (sRNA-seq) libraries, a large variety of tools and methods are used on the dataset in order to understand the data's quality, content, and to summarise the knowledge gained from the entire analysis. Many of the tools available to do this were created for mRNA sequencing (mRNA-seq) datasets. In this thesis, we present and implement a processing pipeline that can be used to assess the quality and the di�erential expression of sRNA-seq datasets over two or more di�erent conditions. We then utilise aspects of this pipeline in various sRNA-seq experiments. Firstly, we combine our pipeline with current tools for miRNA identi�cation to assess the regulation of miRNAs during larval caste di�erentiation in a novel genome; the European bumblebee (Bombus terrestris). Secondly, we explore the di�erential expression during cell stress of all classes of sRNAs using two cell lines in humans. We also �nd that a speci�c protein, Ro60, is required for the expression of mRNA-derived sRNAs during stress, similar to the way in which sRNAs derived from Y RNAs are regulated. Finally, we utilise our understanding of sRNA mapping patterns, alongside current tools for miRNA identi�cation, to search for functional miRNAs and other sRNAs in the novel genomes of two diatoms. The lack of canonical miRNA predictions in this study has repercussions for the evolutionary theory behind miRNAs. The implementation of our pipeline for sRNA-seq data provides an interactive and quality controlled work ow that can be used to process a dataset from raw sequences to the results of several di�erential expression experiments for all identi�ed sRNA classes within a sequenced transcriptome.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'High throughput sequencing (NGS)'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles