Dissertations / Theses: 'High Throughput Phenotypic Data'

1

Yu, Haipeng. "Designing and modeling high-throughput phenotyping data in quantitative genetics." Diss., Virginia Tech, 2020. http://hdl.handle.net/10919/97579.

Full text

Abstract:

Quantitative genetics aims to bridge the genome to phenome gap. The advent of high-throughput genotyping technologies has accelerated the progress of genome to phenome mapping, but a challenge remains in phenotyping. Various high-throughput phenotyping (HTP) platforms have been developed recently to obtain economically important phenotypes in an automated fashion with less human labor and reduced costs. However, the effective way of designing HTP has not been investigated thoroughly. In addition, high-dimensional HTP data bring up a big challenge for statistical analysis by increasing computational demands. A new strategy for modeling high-dimensional HTP data and elucidating the interrelationships among these phenotypes are needed. Previous studies used pedigree-based connectetdness statistics to study the design of phenotyping. The availability of genetic markers provides a new opportunity to evaluate connectedness based on genomic data, which can serve as a means to design HTP. This dissertation first discusses the utility of connectedness spanning in three studies. In the first study, I introduced genomic connectedness and compared it with traditional pedigree-based connectedness. The relationship between genomic connectedness and prediction accuracy based on cross-validation was investigated in the second study. The third study introduced a user-friendly connectedness R package, which provides a suite of functions to evaluate the extent of connectedness. In the last study, I proposed a new statistical approach to model high-dimensional HTP data by leveraging the combination of confirmatory factor analysis and Bayesian network. Collectively, the results from the first three studies suggested the potential usefulness of applying genomic connectedness to design HTP. The statistical approach I introduced in the last study provides a new avenue to model high-dimensional HTP data holistically to further help us understand the interrelationships among phenotypes derived from HTP.
Doctor of Philosophy
Quantitative genetics aims to bridge the genome to phenome gap. With the advent of genotyping technologies, the genomic information of individuals can be included in a quantitative genetic model. A new challenge is to obtain sufficient and accurate phenotypes in an automated fashion with less human labor and reduced costs. The high-throughput phenotyping (HTP) technologies have emerged recently, opening a new opportunity to address this challenge. However, there is a paucity of research in phenotyping design and modeling high-dimensional HTP data. The main themes of this dissertation are 1) genomic connectedness that could potentially be used as a means to design a phenotyping experiment and 2) a novel statistical approach that aims to handle high-dimensional HTP data. In the first three studies, I first compared genomic connectedness with pedigree-based connectedness. This was followed by investigating the relationship between genomic connectedness and prediction accuracy derived from cross-validation. Additionally, I developed a connectedness R package that implements a variety of connectedness measures. The fourth study investigated a novel statistical approach by leveraging the combination of dimension reduction and graphical models to understand the interrelationships among high-dimensional HTP data.

APA, Harvard, Vancouver, ISO, and other styles

2

Manrique, Tito. "Functional linear regression models : application to high-throughput plant phenotyping functional data." Thesis, Montpellier, 2016. http://www.theses.fr/2016MONTT264/document.

Full text

Abstract:

L'Analyse des Données Fonctionnelles (ADF) est une branche de la statistique qui est de plus en plus utilisée dans de nombreux domaines scientifiques appliqués tels que l'expérimentation biologique, la finance, la physique, etc. Une raison à cela est l'utilisation des nouvelles technologies de collecte de données qui augmentent le nombre d'observations dans un intervalle de temps.Les jeux de données fonctionnelles sont des échantillons de réalisations de fonctions aléatoires qui sont des fonctions mesurables définies sur un espace de probabilité à valeurs dans un espace fonctionnel de dimension infinie.Parmi les nombreuses questions étudiées par l'ADF, la régression linéaire fonctionnelle est l'une des plus étudiées, aussi bien dans les applications que dans le développement méthodologique.L'objectif de cette thèse est l'étude de modèles de régression linéaire fonctionnels lorsque la covariable X et la réponse Y sont des fonctions aléatoires et les deux dépendent du temps. En particulier, nous abordons la question de l'influence de l'histoire d'une fonction aléatoire X sur la valeur actuelle d'une autre fonction aléatoire Y à un instant donné t.Pour ce faire, nous sommes surtout intéressés par trois modèles: le modèle fonctionnel de concurrence (Functional Concurrent Model: FCCM), le modèle fonctionnel de convolution (Functional Convolution Model: FCVM) et le modèle linéaire fonctionnel historique. En particulier pour le FCVM et FCCM nous avons proposé des estimateurs qui sont consistants, robustes et plus rapides à calculer par rapport à d'autres estimateurs déjà proposés dans la littérature.Notre méthode d'estimation dans le FCCM étend la méthode de régression Ridge développée dans le cas linéaire classique au cadre de données fonctionnelles. Nous avons montré la convergence en probabilité de cet estimateur, obtenu une vitesse de convergence et développé une méthode de choix optimal du paramètre de régularisation.Le FCVM permet d'étudier l'influence de l'histoire de X sur Y d'une manière simple par la convolution. Dans ce cas, nous utilisons la transformée de Fourier continue pour définir un estimateur du coefficient fonctionnel. Cet opérateur transforme le modèle de convolution en un FCCM associé dans le domaine des fréquences. La consistance et la vitesse de convergence de l'estimateur sont obtenues à partir du FCCM.Le FCVM peut être généralisé au modèle linéaire fonctionnel historique, qui est lui-même un cas particulier du modèle linéaire entièrement fonctionnel. Grâce à cela, nous avons utilisé l'estimateur de Karhunen-Loève du noyau historique. La question connexe de l'estimation de l'opérateur de covariance du bruit dans le modèle linéaire entièrement fonctionnel est également traitée. Finalement nous utilisons tous les modèles mentionnés ci-dessus pour étudier l'interaction entre le déficit de pression de vapeur (Vapour Pressure Deficit: VPD) et vitesse d'élongation foliaire (Leaf Elongation Rate: LER) courbes. Ce type de données est obtenu avec phénotypage végétal haut débit. L'étude est bien adaptée aux méthodes de l'ADF
Functional data analysis (FDA) is a statistical branch that is increasingly being used in many applied scientific fields such as biological experimentation, finance, physics, etc. A reason for this is the use of new data collection technologies that increase the number of observations during a time interval.Functional datasets are realization samples of some random functions which are measurable functions defined on some probability space with values in an infinite dimensional functional space.There are many questions that FDA studies, among which functional linear regression is one of the most studied, both in applications and in methodological development.The objective of this thesis is the study of functional linear regression models when both the covariate X and the response Y are random functions and both of them are time-dependent. In particular we want to address the question of how the history of a random function X influences the current value of another random function Y at any given time t.In order to do this we are mainly interested in three models: the functional concurrent model (FCCM), the functional convolution model (FCVM) and the historical functional linear model. In particular for the FCVM and FCCM we have proposed estimators which are consistent, robust and which are faster to compute compared to others already proposed in the literature.Our estimation method in the FCCM extends the Ridge Regression method developed in the classical linear case to the functional data framework. We prove the probability convergence of this estimator, obtain a rate of convergence and develop an optimal selection procedure of theregularization parameter.The FCVM allows to study the influence of the history of X on Y in a simple way through the convolution. In this case we use the continuous Fourier transform operator to define an estimator of the functional coefficient. This operator transforms the convolution model into a FCCM associated in the frequency domain. The consistency and rate of convergence of the estimator are derived from the FCCM.The FCVM can be generalized to the historical functional linear model, which is itself a particular case of the fully functional linear model. Thanks to this we have used the Karhunen–Loève estimator of the historical kernel. The related question about the estimation of the covariance operator of the noise in the fully functional linear model is also treated.Finally we use all the aforementioned models to study the interaction between Vapour Pressure Deficit (VPD) and Leaf Elongation Rate (LER) curves. This kind of data is obtained with high-throughput plant phenotyping platform and is well suited to be studied with FDA methods

APA, Harvard, Vancouver, ISO, and other styles

3

Paszkowski-Rogacz, Maciej. "Integration and analysis of phenotypic data from functional screens." Doctoral thesis, Saechsische Landesbibliothek- Staats- und Universitaetsbibliothek Dresden, 2011. http://nbn-resolving.de/urn:nbn:de:bsz:14-qucosa-63063.

Full text

Abstract:

Motivation: Although various high-throughput technologies provide a lot of valuable information, each of them is giving an insight into different aspects of cellular activity and each has its own limitations. Thus, a complete and systematic understanding of the cellular machinery can be achieved only by a combined analysis of results coming from different approaches. However, methods and tools for integration and analysis of heterogenous biological data still have to be developed. Results: This work presents systemic analysis of basic cellular processes, i.e. cell viability and cell cycle, as well as embryonic stem cell pluripotency and differentiation. These phenomena were studied using several high-throughput technologies, whose combined results were analysed with existing and novel clustering and hit selection algorithms. This thesis also introduces two novel data management and data analysis tools. The first, called DSViewer, is a database application designed for integrating and querying results coming from various genome-wide experiments. The second, named PhenoFam, is an application performing gene set enrichment analysis by employing structural and functional information on families of protein domains as annotation terms. Both programs are accessible through a web interface. Conclusions: Eventually, investigations presented in this work provide the research community with novel and markedly improved repertoire of computational tools and methods that facilitate the systematic analysis of accumulated information obtained from high-throughput studies into novel biological insights.

APA, Harvard, Vancouver, ISO, and other styles

4

Xue, Zeyun. "Integration of high-throughput phenotyping and genomics data to explore Arabidopsis natural variation." Thesis, université Paris-Saclay, 2020. http://www.theses.fr/2020UPASB001.

Full text

Abstract:

L'azote et l'eau sont essentiels à la survie des plantes ainsi qu'au rendement des cultures, mais les mécanismes moléculaires que les plantes mobilisent en réponse à une déficience en azote (N) en eau (W) et à leur combinaison restent en partie à élucider. Les interconnexions entre l'état hydrique des plantes et la disponibilité de l'azote ont attiré beaucoup d'attention. Étant donné leur importance cruciale, il est très important de disséquer le rôle de chaque stress dans le stress combiné. Nous abordons ici la question de l'intégration des réponses aux stress sécheresse et azoté modérés et de la manière dont ils entravent la croissance des rosettes et le métabolisme des plantes. Dans cette thèse, une investigation systématique a été effectuée pour comprendre comment la carence en azote et en eau se conjuguent pour agir sur la croissance de la rosette chez Arabidopsis. Nous avons intégré des données transcriptomiques et métabolomiques pour obtenir une vue globale des interactions entre sécheresse et stress azoté. De plus, 5 accessions divergentes ont été utilisées pour étudier comment les composants génétiques régulent les réponses au stress, en d'autres termes, les interactions GxWxN. L'évaluation de la déficience en eau, en N et de leur combinaison au niveau transcriptome et métabolome a révélé des signatures de réponse au stress communes et spécifiques qui peuvent être conservées principalement à travers les génotypes, bien que de nombreuses autres réponses spécifiques au génotype aient également été découvertes. Les ajustements des transcriptomes et le profil métabolique spécifiques à l'accession reflètent le niveau physiologique de base distinct de chaque fond génétique, comme Col-0 et Tsu-0. Nous avons également trouvé un sous-ensemble de gènes sensibles au stress qui sont responsables du réglage fin de la réponse combinée au stress, tels que les ROXY, TAR4, NRT2.5, GLN1;4. En outre, nous avons intégré les données transcriptomiques et métabolomiques pour construire un réseau de régulation multi-omique. Deux métabolites réagissant au stress hydrique, le Raffinose et le Myoinositol, ont été mis en évidence par une analyse intégrée montrant des schémas de réponse à la carence en N partagés dans 5 accessions. Cette étude fournit une résolution moléculaire de la variation génétique dans les réponses combinées impliquant des interactions entre la carence en N et le stress hydrique et démontre cette plasticité transcriptomique et métabolomique. En outre, une analyse GWA à grande échelle utilisant un set d’accession mondial a été menée pour déchiffrer l'architecture génétique au niveau métabolique afin de rapprocher la compréhension de la plasticité métabolomique et de la diversité phénotypique et d'étendre notre vision de cette diversité à l'échelle des espèces. La comparaison de l'analyse GWA entre populations régionales et mondiale met en lumière la façon dont la structure de la population peut limiter le pouvoir de détection de l'analyse GWA
Nitrogen and water are crucial for plant survival as well as for crop yield, however the molecular mechanisms that plants mobilise to respond to Nitrogen (N) or Water (W) deficiency and their combination still remain partly unknown. The interconnections between water status and N availability have drawn much attention. Given their critical importance, it is of great importance to dissect the role of each stress in the combined stress. We here address the question of how mild drought and nitrogen stress responses are integrated and how they impaired rosette growth and plant metabolism. In this thesis, a systematic investigation was performed to understand how the N deficiency and drought conjugate to shape dynamic rosette growth in Arabidopsis. We integrated transcriptome and metabolomic data to draw a holistic view of drought x N-deficiency interactions. Moreover, as a case study, 5 highly divergent accessions were used to investigate how genetic components regulate stress responses, in other words, GxWxN interactions. Evaluation of drought, N deficiency and combined stress transcriptomes and metabolomes revealed shared and stress-specific response signatures that were conserved primarily across genotypes, although many more genotype-specific responses also were uncovered. The accession-specific transcriptome adjustments and metabolic profile reflected distinct physiological basal status, such as those of Col-0 and Tsu-0. We also found a subset of stress-responsive genes that are responsible for fine-tuning combined stress response, such as ROXYs, TAR4, NRT2.5, GLN1;4. In addition, we integrated transcriptomic and metabolomic data to construct a multi-omics regulatory network. Two drought stress-responsive metabolites, Raffinose and Myoinositol were highlighted by integrative analysis showing shared N-deficiency patterns in 5 accessions. This study provides molecular resolution of genetic variation in combined stress responses involving interactions between N-deficiency and drought stress and illustrates respective transcriptome and metabolome plasticity. Moreover, large-scale GWA analysis using worldwide populations was conducted to decipher the genetic architecture at the metabolic level and provide links between the metabolomic plasticity and phenotypic diversity behind local adaptation. In addition, this extends our vision of the diversity at the species scale. The comparison of GWA analysis based on regional-scale population and species-wide population also sheds light on how population structure can limit the detection power of GWA analysis

APA, Harvard, Vancouver, ISO, and other styles

5

Mack, Jennifer [Verfasser]. "Constraint-based automated reconstruction of grape bunches from 3D range data for high-throughput phenotyping / Jennifer Mack." Bonn : Universitäts- und Landesbibliothek Bonn, 2019. http://d-nb.info/1200020081/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

6

Mervin, Lewis. "Improved in silico methods for target deconvolution in phenotypic screens." Thesis, University of Cambridge, 2018. https://www.repository.cam.ac.uk/handle/1810/283004.

Full text

Abstract:

Target-based screening projects for bioactive (orphan) compounds have been shown in many cases to be insufficiently predictive for in vivo efficacy, leading to attrition in clinical trials. Phenotypic screening has hence undergone a renaissance in both academia and in the pharmaceutical industry, partly due to this reason. One key shortcoming of this paradigm shift is that the protein targets modulated need to be elucidated subsequently, which is often a costly and time-consuming procedure. In this work, we have explored both improved methods and real-world case studies of how computational methods can help in target elucidation of phenotypic screens. One limitation of previous methods has been the ability to assess the applicability domain of the models, that is, when the assumptions made by a model are fulfilled and which input chemicals are reliably appropriate for the models. Hence, a major focus of this work was to explore methods for calibration of machine learning algorithms using Platt Scaling, Isotonic Regression Scaling and Venn-Abers Predictors, since the probabilities from well calibrated classifiers can be interpreted at a confidence level and predictions specified at an acceptable error rate. Additionally, many current protocols only offer probabilities for affinity, thus another key area for development was to expand the target prediction models with functional prediction (activation or inhibition). This extra level of annotation is important since the activation or inhibition of a target may positively or negatively impact the phenotypic response in a biological system. Furthermore, many existing methods do not utilize the wealth of bioactivity information held for orthologue species. We therefore also focused on an in-depth analysis of orthologue bioactivity data and its relevance and applicability towards expanding compound and target bioactivity space for predictive studies. The realized protocol was trained with 13,918,879 compound-target pairs and comprises 1,651 targets, which has been made available for public use at GitHub. Consequently, the methodology was applied to aid with the target deconvolution of AstraZeneca phenotypic readouts, in particular for the rationalization of cytotoxicity and cytostaticity in the High-Throughput Screening (HTS) collection. Results from this work highlighted which targets are frequently linked to the cytotoxicity and cytostaticity of chemical structures, and provided insight into which compounds to select or remove from the collection for future screening projects. Overall, this project has furthered the field of in silico target deconvolution, by improving the performance and applicability of current protocols and by rationalizing cytotoxicity, which has been shown to influence attrition in clinical trials.

APA, Harvard, Vancouver, ISO, and other styles

7

Roguski, Łukasz 1987. "High-throughput sequencing data compression." Doctoral thesis, Universitat Pompeu Fabra, 2017. http://hdl.handle.net/10803/565775.

Full text

Abstract:

Thanks to advances in sequencing technologies, biomedical research has experienced a revolution over recent years, resulting in an explosion in the amount of genomic data being generated worldwide. The typical space requirement for storing sequencing data produced by a medium-scale experiment lies in the range of tens to hundreds of gigabytes, with multiple files in different formats being produced by each experiment. The current de facto standard file formats used to represent genomic data are text-based. For practical reasons, these are stored in compressed form. In most cases, such storage methods rely on general-purpose text compressors, such as gzip. Unfortunately, however, these methods are unable to exploit the information models specific to sequencing data, and as a result they usually provide limited functionality and insufficient savings in storage space. This explains why relatively basic operations such as processing, storage, and transfer of genomic data have become a typical bottleneck of current analysis setups. Therefore, this thesis focuses on methods to efficiently store and compress the data generated from sequencing experiments. First, we propose a novel general purpose FASTQ files compressor. Compared to gzip, it achieves a significant reduction in the size of the resulting archive, while also offering high data processing speed. Next, we present compression methods that exploit the high sequence redundancy present in sequencing data. These methods achieve the best compression ratio among current state-of-the-art FASTQ compressors, without using any external reference sequence. We also demonstrate different lossy compression approaches to store auxiliary sequencing data, which allow for further reductions in size. Finally, we propose a flexible framework and data format, which allows one to semi-automatically generate compression solutions which are not tied to any specific genomic file format. To facilitate data management needed by complex pipelines, multiple genomic datasets having heterogeneous formats can be stored together in configurable containers, with an option to perform custom queries over the stored data. Moreover, we show that simple solutions based on our framework can achieve results comparable to those of state-of-the-art format-specific compressors. Overall, the solutions developed and described in this thesis can easily be incorporated into current pipelines for the analysis of genomic data. Taken together, they provide grounds for the development of integrated approaches towards efficient storage and management of such data.
Gràcies als avenços en el camp de les tecnologies de seqüenciació, en els darrers anys la recerca biomèdica ha viscut una revolució, que ha tingut com un dels resultats l'explosió del volum de dades genòmiques generades arreu del món. La mida típica de les dades de seqüenciació generades en experiments d'escala mitjana acostuma a situar-se en un rang entre deu i cent gigabytes, que s'emmagatzemen en diversos arxius en diferents formats produïts en cada experiment. Els formats estàndards actuals de facto de representació de dades genòmiques són en format textual. Per raons pràctiques, les dades necessiten ser emmagatzemades en format comprimit. En la majoria dels casos, aquests mètodes de compressió es basen en compressors de text de caràcter general, com ara gzip. Amb tot, no permeten explotar els models d'informació especifícs de dades de seqüenciació. És per això que proporcionen funcionalitats limitades i estalvi insuficient d'espai d'emmagatzematge. Això explica per què operacions relativament bàsiques, com ara el processament, l'emmagatzematge i la transferència de dades genòmiques, s'han convertit en un dels principals obstacles de processos actuals d'anàlisi. Per tot això, aquesta tesi se centra en mètodes d'emmagatzematge i compressió eficients de dades generades en experiments de sequenciació. En primer lloc, proposem un compressor innovador d'arxius FASTQ de propòsit general. A diferència de gzip, aquest compressor permet reduir de manera significativa la mida de l'arxiu resultant del procés de compressió. A més a més, aquesta eina permet processar les dades a una velocitat alta. A continuació, presentem mètodes de compressió que fan ús de l'alta redundància de seqüències present en les dades de seqüenciació. Aquests mètodes obtenen la millor ratio de compressió d'entre els compressors FASTQ del marc teòric actual, sense fer ús de cap referència externa. També mostrem aproximacions de compressió amb pèrdua per emmagatzemar dades de seqüenciació auxiliars, que permeten reduir encara més la mida de les dades. En últim lloc, aportem un sistema flexible de compressió i un format de dades. Aquest sistema fa possible generar de manera semi-automàtica solucions de compressió que no estan lligades a cap mena de format específic d'arxius de dades genòmiques. Per tal de facilitar la gestió complexa de dades, diversos conjunts de dades amb formats heterogenis poden ser emmagatzemats en contenidors configurables amb l'opció de dur a terme consultes personalitzades sobre les dades emmagatzemades. A més a més, exposem que les solucions simples basades en el nostre sistema poden obtenir resultats comparables als compressors de format específic de l'estat de l'art. En resum, les solucions desenvolupades i descrites en aquesta tesi poden ser incorporades amb facilitat en processos d'anàlisi de dades genòmiques. Si prenem aquestes solucions conjuntament, aporten una base sòlida per al desenvolupament d'aproximacions completes encaminades a l'emmagatzematge i gestió eficient de dades genòmiques.

APA, Harvard, Vancouver, ISO, and other styles

8

Prinz, zu Salm-Horstmar Maximilian Philipp Albrecht. "The Chromosome 8p23.1 Inversion : High-Throughput Detection & Investigation of Phenotypic Impact." Thesis, Imperial College London, 2009. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.516479.

Full text

APA, Harvard, Vancouver, ISO, and other styles

9

Jin, Shuangshuang. "Integrated data modeling in high-throughput proteomices." Online access for everyone, 2007. http://www.dissertations.wsu.edu/Dissertations/Fall2007/S_Jin_111907.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

10

Capparuccini, Maria. "Inferential Methods for High-Throughput Methylation Data." VCU Scholars Compass, 2010. http://scholarscompass.vcu.edu/etd/156.

Full text

Abstract:

The role of abnormal DNA methylation in the progression of disease is a growing area of research that relies upon the establishment of sound statistical methods. The common method for declaring there is differential methylation between two groups at a given CpG site, as summarized by the difference between proportions methylated db=b1-b2, has been through use of a Filtered Two Sample t-test, using the recommended filter of 0.17 (Bibikova et al., 2006b). In this dissertation, we performed a re-analysis of the data used in recommending the threshold by fitting a mixed-effects ANOVA model. It was determined that the 0.17 filter is not accurate and conjectured that application of a Filtered Two Sample t-test likely leads to loss of power. Further, the Two Sample t-test assumes that data arise from an underlying distribution encompassing the entire real number line, whereas b1 and b2 are constrained on the interval . Additionally, the imposition of a filter at a level signifying the minimum level of detectable difference to a Two Sample t-test likely reduces power for smaller but truly differentially methylated CpG sites. Therefore, we compared the Two Sample t-test and the Filtered Two Sample t-test, which are widely used but largely untested with respect to their performance, to three proposed methods. These three proposed methods are a Beta distribution test, a Likelihood ratio test, and a Bootstrap test, where each was designed to address distributional concerns present in the current testing methods. It was ultimately shown through simulations comparing Type I and Type II error rates that the (unfiltered) Two Sample t-test and the Beta distribution test performed comparatively well.

APA, Harvard, Vancouver, ISO, and other styles

11

Durif, Ghislain. "Multivariate analysis of high-throughput sequencing data." Thesis, Lyon, 2016. http://www.theses.fr/2016LYSE1334/document.

Full text

Abstract:

L'analyse statistique de données de séquençage à haut débit (NGS) pose des questions computationnelles concernant la modélisation et l'inférence, en particulier à cause de la grande dimension des données. Le travail de recherche dans ce manuscrit porte sur des méthodes de réductions de dimension hybrides, basées sur des approches de compression (représentation dans un espace de faible dimension) et de sélection de variables. Des développements sont menés concernant la régression "Partial Least Squares" parcimonieuse (supervisée) et les méthodes de factorisation parcimonieuse de matrices (non supervisée). Dans les deux cas, notre objectif sera la reconstruction et la visualisation des données. Nous présenterons une nouvelle approche de type PLS parcimonieuse, basée sur une pénalité adaptative, pour la régression logistique. Cette approche sera utilisée pour des problèmes de prédiction (devenir de patients ou type cellulaire) à partir de l'expression des gènes. La principale problématique sera de prendre en compte la réponse pour écarter les variables non pertinentes. Nous mettrons en avant le lien entre la construction des algorithmes et la fiabilité des résultats.Dans une seconde partie, motivés par des questions relatives à l'analyse de données "single-cell", nous proposons une approche probabiliste pour la factorisation de matrices de comptage, laquelle prend en compte la sur-dispersion et l'amplification des zéros (caractéristiques des données single-cell). Nous développerons une procédure d'estimation basée sur l'inférence variationnelle. Nous introduirons également une procédure de sélection de variables probabiliste basée sur un modèle "spike-and-slab". L'intérêt de notre méthode pour la reconstruction, la visualisation et le clustering de données sera illustré par des simulations et par des résultats préliminaires concernant une analyse de données "single-cell". Toutes les méthodes proposées sont implémentées dans deux packages R: plsgenomics et CMF
The statistical analysis of Next-Generation Sequencing data raises many computational challenges regarding modeling and inference, especially because of the high dimensionality of genomic data. The research work in this manuscript concerns hybrid dimension reduction methods that rely on both compression (representation of the data into a lower dimensional space) and variable selection. Developments are made concerning: the sparse Partial Least Squares (PLS) regression framework for supervised classification, and the sparse matrix factorization framework for unsupervised exploration. In both situations, our main purpose will be to focus on the reconstruction and visualization of the data. First, we will present a new sparse PLS approach, based on an adaptive sparsity-inducing penalty, that is suitable for logistic regression to predict the label of a discrete outcome. For instance, such a method will be used for prediction (fate of patients or specific type of unidentified single cells) based on gene expression profiles. The main issue in such framework is to account for the response to discard irrelevant variables. We will highlight the direct link between the derivation of the algorithms and the reliability of the results. Then, motivated by questions regarding single-cell data analysis, we propose a flexible model-based approach for the factorization of count matrices, that accounts for over-dispersion as well as zero-inflation (both characteristic of single-cell data), for which we derive an estimation procedure based on variational inference. In this scheme, we consider probabilistic variable selection based on a spike-and-slab model suitable for count data. The interest of our procedure for data reconstruction, visualization and clustering will be illustrated by simulation experiments and by preliminary results on single-cell data analysis. All proposed methods were implemented into two R-packages "plsgenomics" and "CMF" based on high performance computing

APA, Harvard, Vancouver, ISO, and other styles

12

Zhang, Xuekui. "Mixture models for analysing high throughput sequencing data." Thesis, University of British Columbia, 2011. http://hdl.handle.net/2429/35982.

Full text

Abstract:

The goal of my thesis is to develop methods and software for analysing high-throughput sequencing data, emphasizing sonicated ChIP-seq. For this goal, we developed a few variants of mixture models for genome-wide profiling of transcription factor binding sites and nucleosome positions. Our methods have been implemented into Bioconductor packages, which are freely available to other researchers. For profiling transcription factor binding sites, we developed a method, PICS, and implemented it into a Bioconductor package. We used a simulation study to confirm that PICS compares favourably to rival methods, such as MACS, QuEST, CisGenome, and USeq. Using published GABP and FOXA1 data from human cell lines, we then show that PICS predicted binding sites were more consistent with computationally predicted binding motifs than the alternative methods. For motif discovery using transcription binding sites, we combined PICS with two other existing packages to create the first complete set of Bioconductor tools for peak-calling and binding motif analysis of ChIP-Seq and ChIP-chip data. We demonstrate the effectiveness of our pipeline on published human ChIP-Seq datasets for FOXA1, ER, CTCF and STAT1, detecting co-occurring motifs that were consistent with the literature but not detected by other methods. For nucleosome positioning, we modified PICS into a method called PING. PING can handle MNase-Seq and MNase- or sonicated-ChIP-Seq data. It compares favourably to NPS and TemplateFilter in scalability, accuracy and robustness to low read density. To demonstrate that PING predictions from sonicated data can have sufficient spatial resolution to be biologically meaningful, we use H3K4me1 data to detect nucleosome shifts, discriminate functional and non-functional transcription factor binding sites, and confirm that Foxa2 associates with the accessible major groove of nucleosomal DNA. All of the above uses single-end sequencing data. At the end of the thesis, we briefly discuss the issue of processing paired-end data, which we are currently investigating.

APA, Harvard, Vancouver, ISO, and other styles

13

Hoffmann, Steve. "Genome Informatics for High-Throughput Sequencing Data Analysis." Doctoral thesis, Universitätsbibliothek Leipzig, 2014. http://nbn-resolving.de/urn:nbn:de:bsz:15-qucosa-152643.

Full text

Abstract:

This thesis introduces three different algorithmical and statistical strategies for the analysis of high-throughput sequencing data. First, we introduce a heuristic method based on enhanced suffix arrays to map short sequences to larger reference genomes. The algorithm builds on the idea of an error-tolerant traversal of the suffix array for the reference genome in conjunction with the concept of matching statistics introduced by Chang and a bitvector based alignment algorithm proposed by Myers. The algorithm supports paired-end and mate-pair alignments and the implementation offers methods for primer detection, primer and poly-A trimming. In our own benchmarks as well as independent bench- marks this tool outcompetes other currently available tools with respect to sensitivity and specificity in simulated and real data sets for a large number of sequencing protocols. Second, we introduce a novel dynamic programming algorithm for the spliced alignment problem. The advantage of this algorithm is its capability to not only detect co-linear splice events, i.e. local splice events on the same genomic strand, but also circular and other non-collinear splice events. This succinct and simple algorithm handles all these cases at the same time with a high accuracy. While it is at par with other state- of-the-art methods for collinear splice events, it outcompetes other tools for many non-collinear splice events. The application of this method to publically available sequencing data led to the identification of a novel isoform of the tumor suppressor gene p53. Since this gene is one of the best studied genes in the human genome, this finding is quite remarkable and suggests that the application of our algorithm could help to identify a plethora of novel isoforms and genes. Third, we present a data adaptive method to call single nucleotide variations (SNVs) from aligned high-throughput sequencing reads. We demonstrate that our method based on empirical log-likelihoods automatically adjusts to the quality of a sequencing experiment and thus renders a \"decision\" on when to call an SNV. In our simulations this method is at par with current state-of-the-art tools. Finally, we present biological results that have been obtained using the special features of the presented alignment algorithm
Diese Arbeit stellt drei verschiedene algorithmische und statistische Strategien für die Analyse von Hochdurchsatz-Sequenzierungsdaten vor. Zuerst führen wir eine auf enhanced Suffixarrays basierende heuristische Methode ein, die kurze Sequenzen mit grossen Genomen aligniert. Die Methode basiert auf der Idee einer fehlertoleranten Traversierung eines Suffixarrays für Referenzgenome in Verbindung mit dem Konzept der Matching-Statistik von Chang und einem auf Bitvektoren basierenden Alignmentalgorithmus von Myers. Die vorgestellte Methode unterstützt Paired-End und Mate-Pair Alignments, bietet Methoden zur Erkennung von Primersequenzen und zum trimmen von Poly-A-Signalen an. Auch in unabhängigen Benchmarks zeichnet sich das Verfahren durch hohe Sensitivität und Spezifität in simulierten und realen Datensätzen aus. Für eine große Anzahl von Sequenzierungsprotokollen erzielt es bessere Ergebnisse als andere bekannte Short-Read Alignmentprogramme. Zweitens stellen wir einen auf dynamischer Programmierung basierenden Algorithmus für das spliced alignment problem vor. Der Vorteil dieses Algorithmus ist seine Fähigkeit, nicht nur kollineare Spleiß- Ereignisse, d.h. Spleiß-Ereignisse auf dem gleichen genomischen Strang, sondern auch zirkuläre und andere nicht-kollineare Spleiß-Ereignisse zu identifizieren. Das Verfahren zeichnet sich durch eine hohe Genauigkeit aus: während es bei der Erkennung kollinearer Spleiß-Varianten vergleichbare Ergebnisse mit anderen Methoden erzielt, schlägt es die Wettbewerber mit Blick auf Sensitivität und Spezifität bei der Vorhersage nicht-kollinearer Spleißvarianten. Die Anwendung dieses Algorithmus führte zur Identifikation neuer Isoformen. In unserer Publikation berichten wir über eine neue Isoform des Tumorsuppressorgens p53. Da dieses Gen eines der am besten untersuchten Gene des menschlichen Genoms ist, könnte die Anwendung unseres Algorithmus helfen, eine Vielzahl weiterer Isoformen bei weniger prominenten Genen zu identifizieren. Drittens stellen wir ein datenadaptives Modell zur Identifikation von Single Nucleotide Variations (SNVs) vor. In unserer Arbeit zeigen wir, dass sich unser auf empirischen log-likelihoods basierendes Modell automatisch an die Qualität der Sequenzierungsexperimente anpasst und eine \"Entscheidung\" darüber trifft, welche potentiellen Variationen als SNVs zu klassifizieren sind. In unseren Simulationen ist diese Methode auf Augenhöhe mit aktuell eingesetzten Verfahren. Schließlich stellen wir eine Auswahl biologischer Ergebnisse vor, die mit den Besonderheiten der präsentierten Alignmentverfahren in Zusammenhang stehen

APA, Harvard, Vancouver, ISO, and other styles

14

Stromberg, Michael Peter. "Enabling high-throughput sequencing data analysis with MOSAIK." Thesis, Boston College, 2010. http://hdl.handle.net/2345/1332.

Full text

Abstract:

Thesis advisor: Gabor T. Marth
During the last few years, numerous new sequencing technologies have emerged that require tools that can process large amounts of read data quickly and accurately. Regardless of the downstream methods used, reference-guided aligners are at the heart of all next-generation analysis studies. I have developed a general reference-guided aligner, MOSAIK, to support all current sequencing technologies (Roche 454, Illumina, Applied Biosystems SOLiD, Helicos, and Sanger capillary). The calibrated alignment qualities calculated by MOSAIK allow the user to fine-tune the alignment accuracy for a given study. MOSAIK is a highly configurable and easy-to-use suite of alignment tools that is used in hundreds of labs worldwide. MOSAIK is an integral part of our genetic variant discovery pipeline. From SNP and short-INDEL discovery to structural variation discovery, alignment accuracy is an essential requirement and enables our downstream analyses to provide accurate calls. In this thesis, I present three major studies that were formative during the development of MOSAIK and our analysis pipeline. In addition, I present a novel algorithm that identifies mobile element insertions (non-LTR retrotransposons) in the human genome using split-read alignments in MOSAIK. This algorithm has a low false discovery rate (4.4 %) and enabled our group to be the first to determine the number of mobile elements that differentially occur between any two individuals
Thesis (PhD) — Boston College, 2010
Submitted to: Boston College. Graduate School of Arts and Sciences
Discipline: Biology

APA, Harvard, Vancouver, ISO, and other styles

15

Xing, Zhengrong. "Poisson multiscale methods for high-throughput sequencing data." Thesis, The University of Chicago, 2016. http://pqdtopen.proquest.com/#viewpdf?dispub=10195268.

Full text

Abstract:

In this dissertation, we focus on the problem of analyzing data from high-throughput sequencing experiments. With the emergence of more capable hardware and more efficient software, these sequencing data provide information at an unprecedented resolution. However, statistical methods developed for such data rarely tackle the data at such high resolutions, and often make approximations that only hold under certain conditions.

We propose a model-based approach to dealing with such data, starting from a single sample. By taking into account the inherent structure present in such data, our model can accurately capture important genomic regions. We also present the model in such a way that makes it easily extensible to more complicated and biologically interesting scenarios.

Building upon the single-sample model, we then turn to the statistical question of detecting differences between multiple samples. Such questions often arise in the context of expression data, where much emphasis has been put on the problem of detecting differential expression between two groups. By extending the framework for a single sample to incorporate additional group covariates, our model provides a systematic approach to estimating and testing for such differences. We then apply our method to several empirical datasets, and discuss the potential for further applications to other biological tasks.

We also seek to address a different statistical question, where the goal here is to perform exploratory analysis to uncover hidden structure within the data. We incorporate the single-sample framework into a commonly used clustering scheme, and show that our enhanced clustering approach is superior to the original clustering approach in many ways. We then apply our clustering method to a few empirical datasets and discuss our findings.

Finally, we apply the shrinkage procedure used within the single-sample model to tackle a completely different statistical issue: nonparametric regression with heteroskedastic Gaussian noise. We propose an algorithm that accurately recovers both the mean and variance functions given a single set of observations, and demonstrate its advantages over state-of-the art methods through extensive simulation studies.

APA, Harvard, Vancouver, ISO, and other styles

16

Wang, Yuanyuan (Marcia). "Statistical Methods for High Throughput Screening Drug Discovery Data." Thesis, University of Waterloo, 2005. http://hdl.handle.net/10012/1204.

Full text

Abstract:

High Throughput Screening (HTS) is used in drug discovery to screen large numbers of compounds against a biological target. Data on activity against the target are collected for a representative sample of compounds selected from a large library. The goal of drug discovery is to relate the activity of a compound to its chemical structure, which is quantified by various explanatory variables, and hence to identify further active compounds. Often, this application has a very unbalanced class distribution, with a rare active class.

Classification methods are commonly proposed as solutions to this problem. However, regarding drug discovery, researchers are more interested in ranking compounds by predicted activity than in the classification itself. This feature makes my approach distinct from common classification techniques.

In this thesis, two AIDS data sets from the National Cancer Institute (NCI) are mainly used. Local methods, namely K-nearest neighbours (KNN) and classification and regression trees (CART), perform very well on these data in comparison with linear/logistic regression, neural networks, and Multivariate Adaptive Regression Splines (MARS) models, which assume more smoothness. One reason for the superiority of local methods is the local behaviour of the data. Indeed, I argue that conventional classification criteria such as misclassification rate or deviance tend to select too small a tree or too large a value of k (the number of nearest neighbours). A more local model (bigger tree or smaller k) gives a better performance in terms of drug discovery.

Because off-the-shelf KNN works relatively well, this thesis takes this promising method and makes several novel modifications, which further improve its performance. The choice of k is optimized for each test point to be predicted. The empirically observed superiority of allowing k to vary is investigated. The nature of the problem, ranking of objects rather than estimating the probability of activity, enables the k-varying algorithm to stand out. Similarly, KNN combined with a kernel weight function (weighted KNN) is proposed and demonstrated to be superior to the regular KNN method.

High dimensionality of the explanatory variables is known to cause problems for KNN and many other classifiers. I propose a novel method (subset KNN) of averaging across multiple classifiers based on building classifiers on subspaces (subsets of variables). It improves the performance of KNN for HTS data. When applied to CART, it also performs as well as or even better than the popular methods of bagging and boosting. Part of this improvement is due to the discovery that classifiers based on irrelevant subspaces (unimportant explanatory variables) do little damage when averaged with good classifiers based on relevant subspaces (important variables). This result is particular to the ranking of objects rather than estimating the probability of activity. A theoretical justification is proposed. The thesis also suggests diagnostics for identifying important subsets of variables and hence further reducing the impact of the curse of dimensionality.

In order to have a broader evaluation of these methods, subset KNN and weighted KNN are applied to three other data sets: the NCI AIDS data with Constitutional descriptors, Mutagenicity data with BCUT descriptors and Mutagenicity data with Constitutional descriptors. The k-varying algorithm as a method for unbalanced data is also applied to NCI AIDS data with Constitutional descriptors. As a baseline, the performance of KNN on such data sets is reported. Although different methods are best for the different data sets, some of the proposed methods are always amongst the best.

Finally, methods are described for estimating activity rates and error rates in HTS data. By combining auxiliary information about repeat tests of the same compound, likelihood methods can extract interesting information about the magnitudes of the measurement errors made in the assay process. These estimates can be used to assess model performance, which sheds new light on how various models handle the large random or systematic assay errors often present in HTS data.

APA, Harvard, Vancouver, ISO, and other styles

17

Yang, Yang. "Data mining support for high-throughput discovery of nanomaterials." Thesis, University of Leeds, 2011. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.577527.

Full text

Abstract:

Nanotechnology is becoming a promising technology due to its potential to dramatically improve the effectiveness of a number of existing consumer and industrial products, such as drug delivery systems, electronic circuits, catalysts and light-harvesting materials. However, the ability of industry and academia to accelerate the discovery of new nanomaterials is severely limited by the speed at which new compositions can be made and tested for suitable properties. A promising alternative approach for nanophotocatalyst discovery, currently under development at University College London (UCL) and University of Leeds, utilizes recent advances in nanomaterials synthesis and automation to implement a high-throughput (HT) experimental system to enable rapid exploration of materials space. The HT nanocatalyst discovery is an automated continuous process using hydrothermal synthesis which can synthesize a large number of nanoparticles in a short time. The nanoparticles formulated are characterized and tested on as many samples as possible for indicative properties rather than conduct comprehensive characterisation on each sample. This thesis describes the development of chemome~ric and inductive data-mining tools to support the HT nanomaterial discovery process. The work will be reported mainly in five parts, including a data management system structured to reflect the flow of HT work flow, an information flow management system designed for correspondence between UCL and Leeds, a prototype data mining system tailored for HT experimental data processing and analysis as well as its application, a new Design of Experiments (DoE) using genetic algorithm which was proved to be able to handle variables effectively, and robust quantitative structure-activity relationship (QSAR) models with genetic parameter optimization for HT catalyst discovery. In contrast to the enormous benefits, nanoparticles also bring adverse effects to the biological environment and human health. Due to the diversity of nanoparticles, as well as the dependence of toxicity on the physico-chemical properties of nanoparticles, it is not possible to test every particle. A rational way without testing ii every single nanoparticle and its variants is to relate the physicochemical characteristics of nanoparticles with their toxicity in a QSAR model. This thesis also researches the measurement methods of structural and composition properties including size distribution, surface area, morphological parameters and assessment methods for comparative toxicity of the nanoparticles in a panel of 18 nanoparticles which were chosen by University of Edinburgh, UBI Institute and University of Leeds. The major contribution of this work is in using data mining methods to analyze toxicity and measured structural and composition properties to find the relationships between the toxicity and physico-chemical properties of nanoparticles. Structure-activity relationship (SAR) analysis focused on identifying the possible structural and compositional properties that determine the cytotoxicity of nanoparticles. iii

APA, Harvard, Vancouver, ISO, and other styles

18

Birchall, Kristian. "Reduced graph approaches to analysing high-throughput screening data." Thesis, University of Sheffield, 2007. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.443869.

Full text

APA, Harvard, Vancouver, ISO, and other styles

19

Fritz, Markus Hsi-Yang. "Exploiting high throughput DNA sequencing data for genomic analysis." Thesis, University of Cambridge, 2012. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.610819.

Full text

APA, Harvard, Vancouver, ISO, and other styles

20

Woolford, Julie Ruth. "Statistical analysis of small RNA high-throughput sequencing data." Thesis, University of Cambridge, 2012. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.610375.

Full text

APA, Harvard, Vancouver, ISO, and other styles

21

Chen, Li. "Integrative Modeling and Analysis of High-throughput Biological Data." Diss., Virginia Tech, 2010. http://hdl.handle.net/10919/30192.

Full text

Abstract:

Computational biology is an interdisciplinary field that focuses on developing mathematical models and algorithms to interpret biological data so as to understand biological problems. With current high-throughput technology development, different types of biological data can be measured in a large scale, which calls for more sophisticated computational methods to analyze and interpret the data. In this dissertation research work, we propose novel methods to integrate, model and analyze multiple biological data, including microarray gene expression data, protein-DNA interaction data and protein-protein interaction data. These methods will help improve our understanding of biological systems. First, we propose a knowledge-guided multi-scale independent component analysis (ICA) method for biomarker identification on time course microarray data. Guided by a knowledge gene pool related to a specific disease under study, the method can determine disease relevant biological components from ICA modes and then identify biologically meaningful markers related to the specific disease. We have applied the proposed method to yeast cell cycle microarray data and Rsf-1-induced ovarian cancer microarray data. The results show that our knowledge-guided ICA approach can extract biologically meaningful regulatory modes and outperform several baseline methods for biomarker identification. Second, we propose a novel method for transcriptional regulatory network identification by integrating gene expression data and protein-DNA binding data. The approach is built upon a multi-level analysis strategy designed for suppressing false positive predictions. With this strategy, a regulatory module becomes increasingly significant as more relevant gene sets are formed at finer levels. At each level, a two-stage support vector regression (SVR) method is utilized to reduce false positive predictions by integrating binding motif information and gene expression data; a significance analysis procedure is followed to assess the significance of each regulatory module. The resulting performance on simulation data and yeast cell cycle data shows that the multi-level SVR approach outperforms other existing methods in the identification of both regulators and their target genes. We have further applied the proposed method to breast cancer cell line data to identify condition-specific regulatory modules associated with estrogen treatment. Experimental results show that our method can identify biologically meaningful regulatory modules related to estrogen signaling and action in breast cancer. Third, we propose a bootstrapping Markov Random Filed (MRF)-based method for subnetwork identification on microarray data by incorporating protein-protein interaction data. Methodologically, an MRF-based network score is first derived by considering the dependency among genes to increase the chance of selecting hub genes. A modified simulated annealing search algorithm is then utilized to find the optimal/suboptimal subnetworks with maximal network score. A bootstrapping scheme is finally implemented to generate confident subnetworks. Experimentally, we have compared the proposed method with other existing methods, and the resulting performance on simulation data shows that the bootstrapping MRF-based method outperforms other methods in identifying ground truth subnetwork and hub genes. We have then applied our method to breast cancer data to identify significant subnetworks associated with drug resistance. The identified subnetworks not only show good reproducibility across different data sets, but indicate several pathways and biological functions potentially associated with the development of breast cancer and drug resistance. In addition, we propose to develop network-constrained support vector machines (SVM) for cancer classification and prediction, by taking into account the network structure to construct classification hyperplanes. The simulation study demonstrates the effectiveness of our proposed method. The study on the real microarray data sets shows that our network-constrained SVM, together with the bootstrapping MRF-based subnetwork identification approach, can achieve better classification performance compared with conventional biomarker selection approaches and SVMs. We believe that the research presented in this dissertation not only provides novel and effective methods to model and analyze different types of biological data, the extensive experiments on several real microarray data sets and results also show the potential to improve the understanding of biological mechanisms related to cancers by generating novel hypotheses for further study.
Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

22

Kannan, Anusha Aiyalu. "Detecting relevant changes in high throughput gene expression data /." Online version of thesis, 2008. http://hdl.handle.net/1850/10832.

Full text

APA, Harvard, Vancouver, ISO, and other styles

23

Shan, Jing. "High-Throughput Identification of Molecular Factors that Promote Phenotypic Stabilization of Primary Human Hepatocytes in vitro." Thesis, Harvard University, 2016. http://nrs.harvard.edu/urn-3:HUL.InstRepos:27007729.

Full text

Abstract:

Liver disease is a leading cause of morbidity worldwide and treatment options are limited, with organ transplantation being the only form of definitive management. Cell-based therapies have long held promise as alternatives to whole-organ transplantation, but their development has been hindered by the rapid loss of liver-specific functions in cultured hepatocytes. The overall goal of this thesis is to systematically identify genetic factors involved in hepatocyte phenotype maintenance in vitro in order to help generate a source of functional human hepatocytes for studying liver biology and treating liver disease. It is our hypothesis that molecular signals from the stroma provide inductive cues to maintain liver phenotype and that these stromal signals can be isolated and used to stabilize primary human hepatocytes in vitro. We report here the development of a high-throughput human liver model and attendant automatable assays capable of reflecting human liver physiology. These tools were used to conduct genetic knock down screens of over 450 stromal factors in over 5000 two-way combinations in order to identify molecules important for hepatocyte functions. Results suggest that multiple signaling molecules are involved in stromal-mediated stabilization of hepatocytes ex vivo. Adsorption of hit molecules such as Activin A onto tissue culture plastic improved hepatocyte survival and morphology, and may be acting via signaling pathways that inhibit cell cycle progression and apoptosis. These results represent important first steps in the elucidation of mechanisms instrumental to the functional maintenance of hepatocytes in vitro, and we hope this new insight will guide the assembly of a cocktail of recombinant acellular stromal products capable of replacing stromal cells in hepatic tissue engineering applications.

APA, Harvard, Vancouver, ISO, and other styles

24

Šupraha, Luka. "Phenotypic evolution and adaptive strategies in marine phytoplankton (Coccolithophores)." Doctoral thesis, Uppsala universitet, Paleobiologi, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-302903.

Full text

Abstract:

Coccolithophores are biogeochemically important marine algae that interact with the carbon cycle through photosynthesis (CO2 sink), calcification (CO2 source) and burial of carbon into oceanic sediments. The group is considered susceptible to the ongoing climate perturbations, in particular to ocean acidification, temperature increase and nutrient limitation. The aim of this thesis was to investigate the adaptation of coccolithophores to environmental change, with the focus on temperature stress and nutrient limitation. The research was conducted in frame of three approaches: experiments testing the physiological response of coccolithophore species Helicosphaera carteri and Coccolithus pelagicus to phosphorus limitation, field studies on coccolithophore life-cycles with a method comparison and an investigation of the phenotypic evolution of the coccolithophore genus Helicosphaera over the past 15 Ma. Experimental results show that the physiology and morphology of large coccolithophores are sensitive to phosphorus limitation, and that the adaptation to low-nutrient conditions can lead to a decrease in calcification rates. Field studies have contributed to our understanding of coccolithophore life cycles, revealing complex ecological patterns within the Mediterranean community which are seemingly regulated by seasonal, temperature-driven environment changes. In addition, the high-throughput sequencing (HTS) molecular method was shown to provide overall good representation of coccolithophore community composition. Finally, the study on Helicosphaera evolution showed that adaptation to decreasing CO2 in higher latitudes involved cell and coccolith size decrease, whereas the adaptation in tropical ecosystems also included a physiological decrease in calcification rates in response to nutrient limitation. This thesis advanced our understanding of coccolithophore adaptive strategies and will improve our predictions on the fate of the group under ongoing climate change.

APA, Harvard, Vancouver, ISO, and other styles

25

Lu, Feng. "Big data scalability for high throughput processing and analysis of vehicle engineering data." Thesis, KTH, Skolan för informations- och kommunikationsteknik (ICT), 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-207084.

Full text

Abstract:

"Sympathy for Data" is a platform that is utilized for Big Data automation analytics. It is based on visual interface and workflow configurations. The main purpose of the platform is to reuse parts of code for structured analysis of vehicle engineering data. However, there are some performance issues on a single machine for processing a large amount of data in Sympathy for Data. There are also disk and CPU IO intensive issues when the data is oversized and the platform need fits comfortably in memory. In addition, for data over the TB or PB level, the Sympathy for data needs separate functionality for efficient processing simultaneously and scalable for distributed computation functionality. This paper focuses on exploring the possibilities and limitations in using the Sympathy for Data platform in various data analytic scenarios within the Volvo Cars vision and strategy. This project re-writes the CDE workflow for over 300 nodes into pure Python script code and make it executable on the Apache Spark and Dask infrastructure. We explore and compare both distributed computing frameworks implemented on Amazon Web Service EC2 used for 4 machine with a 4x type for distributed cluster measurement. However, the benchmark results show that Spark is superior to Dask from performance perspective. Apache Spark and Dask will combine with Sympathy for Data products for a Big Data processing engine to optimize the system disk and CPU IO utilization. There are several challenges when using Spark and Dask to analyze large-scale scientific data on systems. For instance, parallel file systems are shared among all computing machines, in contrast to shared-nothing architectures. Moreover, accessing data stored in commonly used scientific data formats, such as HDF5 is not tentatively supported in Spark. This report presents research carried out on the next generation of Big Data platforms in the automotive industry called "Sympathy for Data". The research questions focusing on improving the I/O performance and scalable distributed function to promote Big Data analytics. During this project, we used the Dask.Array parallelism features for interpretation the data sources as a raster shows in table format, and Apache Spark used as data processing engine for parallelism to load data sources to memory for improving the big data computation capacity. The experiments chapter will demonstrate 640GB of engineering data benchmark for single node and distributed computation mode to evaluate the Sympathy for Data Disk CPU and memory metrics. Finally, the outcome of this project improved the six times performance of the original Sympathy for data by developing a middleware SparkImporter. It is used in Sympathy for Data for distributed computation and connected to the Apache Spark for data processing through the maximum utilization of the system resources. This improves its throughput, scalability, and performance. It also increases the capacity of the Sympathy for data to process Big Data and avoids big data cluster infrastructures.

APA, Harvard, Vancouver, ISO, and other styles

26

Zandegiacomo, Cella Alice. "Multiplex network analysis with application to biological high-throughput data." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2016. http://amslaurea.unibo.it/10495/.

Full text

Abstract:

In questa tesi vengono studiate alcune caratteristiche dei network a multiplex; in particolare l'analisi verte sulla quantificazione delle differenze fra i layer del multiplex. Le dissimilarita sono valutate sia osservando le connessioni di singoli nodi in layer diversi, sia stimando le diverse partizioni dei layer. Sono quindi introdotte alcune importanti misure per la caratterizzazione dei multiplex, che vengono poi usate per la costruzione di metodi di community detection . La quantificazione delle differenze tra le partizioni di due layer viene stimata utilizzando una misura di mutua informazione. Viene inoltre approfondito l'uso del test dell'ipergeometrica per la determinazione di nodi sovra-rappresentati in un layer, mostrando l'efficacia del test in funzione della similarita dei layer. Questi metodi per la caratterizzazione delle proprieta dei network a multiplex vengono applicati a dati biologici reali. I dati utilizzati sono stati raccolti dallo studio DILGOM con l'obiettivo di determinare le implicazioni genetiche, trascrittomiche e metaboliche dell'obesita e della sindrome metabolica. Questi dati sono utilizzati dal progetto Mimomics per la determinazione di relazioni fra diverse omiche. Nella tesi sono analizzati i dati metabolici utilizzando un approccio a multiplex network per verificare la presenza di differenze fra le relazioni di composti sanguigni di persone obese e normopeso.

APA, Harvard, Vancouver, ISO, and other styles

27

Bleuler, Stefan. "Search heuristics for module identification from biological high-throughput data /." [S.l.] : [s.n.], 2008. http://e-collection.ethbib.ethz.ch/show?type=diss&nr=17386.

Full text

APA, Harvard, Vancouver, ISO, and other styles

28

Kircher, Martin. "Understanding and improving high-throughput sequencing data production and analysis." Doctoral thesis, Universitätsbibliothek Leipzig, 2011. http://nbn-resolving.de/urn:nbn:de:bsz:15-qucosa-71102.

Full text

Abstract:

Advances in DNA sequencing revolutionized the field of genomics over the last 5 years. New sequencing instruments make it possible to rapidly generate large amounts of sequence data at substantially lower cost. These high-throughput sequencing technologies (e.g. Roche 454 FLX, Life Technology SOLiD, Dover Polonator, Helicos HeliScope and Illumina Genome Analyzer) make whole genome sequencing and resequencing, transcript sequencing as well as quantification of gene expression, DNA-protein interactions and DNA methylation feasible at an unanticipated scale. In the field of evolutionary genomics, high-throughput sequencing permitted studies of whole genomes from ancient specimens of different hominin groups. Further, it allowed large-scale population genetics studies of present-day humans as well as different types of sequence-based comparative genomics studies in primates. Such comparisons of humans with closely related apes and hominins are important not only to better understand human origins and the biological background of what sets humans apart from other organisms, but also for understanding the molecular basis for diseases and disorders, particularly those that affect uniquely human traits, such as speech disorders, autism or schizophrenia. However, while the cost and time required to create comparative data sets have been greatly reduced, the error profiles and limitations of the new platforms differ significantly from those of previous approaches. This requires a specific experimental design in order to circumvent these issues, or to handle them during data analysis. During the course of my PhD, I analyzed and improved current protocols and algorithms for next generation sequencing data, taking into account the specific characteristics of these new sequencing technologies. The presented approaches and algorithms were applied in different projects and are widely used within the department of Evolutionary Genetics at the Max Planck Institute of Evolutionary Anthropology. In this thesis, I will present selected analyses from the whole genome shotgun sequencing of two ancient hominins and the quantification of gene expression from short-sequence tags in five tissues from three primates.

APA, Harvard, Vancouver, ISO, and other styles

29

Mohamadi, Hamid. "Parallel algorithms and software tools for high-throughput sequencing data." Thesis, University of British Columbia, 2017. http://hdl.handle.net/2429/62072.

Full text

Abstract:

With growing throughput and dropping cost of High-Throughput Sequencing (HTS) technologies, there is a continued need to develop faster and more cost-effective bioinformatics solutions. However, the algorithms and computational power required to efficiently analyze HTS data have lagged considerably. In health and life sciences research organizations, de novo assembly and sequence alignment have become two key steps in everyday research and analysis. The de novo assembly process is a fundamental step in analyzing previously uncharacterized organisms and is one of the most computationally demanding problems in bioinformatics. The sequence alignment is a fundamental operation in a broad spectrum of genomics projects. In genome resequencing projects, they are often used prior to variant calling. In transcriptome resequencing, they provide information on gene expression. They are even used in de novo sequencing projects to help contiguate assembled sequences. As such designing efficient, scalable, and accurate solutions for de novo assembly and sequence alignment problems would have a wide effect in the field. In this thesis, I present a collection of novel algorithms and software tools for the analysis of high-throughput sequencing data using efficient data structures. I also utilize the latest advances in parallel and distributed computing to design and develop scalable and cost-effective algorithms on High-Performance Computing (HPC) infrastructures especially for the de novo assembly and sequence alignment problems. The algorithms and software solutions I develop are publicly available for free for academic use, to facilitate research at health and life sciences laboratories and other organizations worldwide.
Science, Faculty of
Graduate

APA, Harvard, Vancouver, ISO, and other styles

30

Gustafsson, Mika. "Gene networks from high-throughput data : Reverse engineering and analysis." Doctoral thesis, Linköpings universitet, Kommunikations- och transportsystem, 2010. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-54089.

Full text

Abstract:

Experimental innovations starting in the 1990’s leading to the advent of high-throughput experiments in cellular biology have made it possible to measure thousands of genes simultaneously at a modest cost. This enables the discovery of new unexpected relationships between genes in addition to the possibility of falsify existing. To benefit as much as possible from these experiments the new inter disciplinary research field of systems biology have materialized. Systems biology goes beyond the conventional reductionist approach and aims at learning the whole system under the assumption that the system is greater than the sum of its parts. One emerging enterprise in systems biology is to use the high-throughput data to reverse engineer the web of gene regulatory interactions governing the cellular dynamics. This relatively new endeavor goes further than clustering genes with similar expression patterns and requires the separation of cause of gene expression from the effect. Despite the rapid data increase we then face the problem of having too few experiments to determine which regulations are active as the number of putative interactions has increased dramatic as the number of units in the system has increased. One possibility to overcome this problem is to impose more biologically motivated constraints. However, what is a biological fact or not is often not obvious and may be condition dependent. Moreover, investigations have suggested several statistical facts about gene regulatory networks, which motivate the development of new reverse engineering algorithms, relying on different model assumptions. As a result numerous new reverse engineering algorithms for gene regulatory networks has been proposed. As a consequent, there has grown an interest in the community to assess the performance of different attempts in fair trials on “real” biological problems. This resulted in the annually held DREAM conference which contains computational challenges that can be solved by the prosing researchers directly, and are evaluated by the chairs of the conference after the submission deadline. This thesis contains the evolution of regularization schemes to reverse engineer gene networks from high-throughput data within the framework of ordinary differential equations. Furthermore, to understand gene networks a substantial part of it also concerns statistical analysis of gene networks. First, we reverse engineer a genome-wide regulatory network based solely on microarray data utilizing an extremely simple strategy assuming sparseness (LASSO). To validate and analyze this network we also develop some statistical tools. Then we present a refinement of the initial strategy which is the algorithm for which we achieved best performer at the DREAM2 conference. This strategy is further refined into a reverse engineering scheme which also can include external high-throughput data, which we confirm to be of relevance as we achieved best performer in the DREAM3 conference as well. Finally, the tools we developed to analyze stability and flexibility in linearized ordinary differential equations representing gene regulatory networks is further discussed.

APA, Harvard, Vancouver, ISO, and other styles

31

Cunningham, Gordon John. "Application of cluster analysis to high-throughput multiple data types." Thesis, University of Glasgow, 2011. http://theses.gla.ac.uk/2715/.

Full text

Abstract:

PolySNAP is a program used for analysis of high-throughput powder diffraction data. The program matches diffraction patterns using Pearson and Spearman correlation coefficients to measure the similarity of the profiles of each pattern with every other pattern, which creates a correlation matrix. This correlation matrix is then used to partition the patterns into groups using a variety of cluster analysis methods. The original version could not handle any data types other than powder X-ray Diffraction. The aim of this project was to expand the methods used in PolySNAP to allow it to analyse other data types, in particular Raman spectroscopy, differential scanning calorimetry and infrared spectroscopy data. This involves the preparation of suitable compounds which can be analysed using these techniques. The main compounds studied are sulfathiazole, carbamazepine and piroxicam. Some additional studies have been carried out on other datasets, including a test on an unseen dataset to test the efficacy of the methods. The optimal method for clustering any unknown dataset has also been determined.

APA, Harvard, Vancouver, ISO, and other styles

32

Ainsworth, David. "Computational approaches for metagenomic analysis of high-throughput sequencing data." Thesis, Imperial College London, 2016. http://hdl.handle.net/10044/1/44070.

Full text

Abstract:

High-throughput DNA sequencing has revolutionised microbiology and is the foundation on which the nascent field of metagenomics has been built. This ability to cheaply sample billions of DNA reads directly from environments has democratised sequencing and allowed researchers to gain unprecedented insights into diverse microbial communities. These technologies however are not without their limitations: the short length of the reads requires the production of vast amounts of data to ensure all information is captured. This 'data deluge' has been a major bottleneck and has necessitated the development of new algorithms for analysis. Sequence alignment methods provide the most information about the composition of a sample as they allow both taxonomic and functional classification but algorithms are prohibitively slow. This inefficiency has led to the reliance on faster algorithms which only produce simple taxonomic classification or abundance estimation, losing the valuable information given by full alignments against annotated genomes. This thesis will describe k-SLAM, a novel ultra-fast method for the alignment and taxonomic classification of metagenomic data. Using a k -mer based method k-SLAM achieves speeds three orders of magnitude faster than current alignment based approaches, allowing a full taxonomic classification and gene identification to be tractable on modern large datasets. The alignments found by k-SLAM can also be used to find variants and identify genes, along with their nearest taxonomic origins. A novel pseudo-assembly method produces more specific taxonomic classifications on species which have high sequence identity within their genus. This provides a significant (up to 40%) increase in accuracy on these species. Also described is a re-analysis of a Shiga-toxin producing E. coli O104:H4 isolate via alignment against bacterial and viral species to find antibiotic resistance and toxin producing genes. k-SLAM has been used by a range of research projects including FLORINASH and is currently being used by a number of groups.

APA, Harvard, Vancouver, ISO, and other styles

33

Lasher, Christopher Donald. "Discovering contextual connections between biological processes using high-throughput data." Diss., Virginia Tech, 2011. http://hdl.handle.net/10919/77217.

Full text

Abstract:

Hearkening to calls from life scientists for aid in interpreting rapidly-growing repositories of data, the fields of bioinformatics and computational systems biology continue to bear increasingly sophisticated methods capable of summarizing and distilling pertinent phenomena captured by high-throughput experiments. Techniques in analysis of genome-wide gene expression (e.g., microarray) data, for example, have moved beyond simply detecting individual genes perturbed in treatment-control experiments to reporting the collective perturbation of biologically-related collections of genes, or "processes". Recent expression analysis methods have focused on improving comprehensibility of results by reporting concise, non-redundant sets of processes by leveraging statistical modeling techniques such as Bayesian networks. Simultaneously, integrating gene expression measurements with gene interaction networks has led to computation of response networks--subgraphs of interaction networks in which genes exhibit strong collective perturbation or co-expression. Methods that integrate process annotations of genes with interaction networks identify high-level connections between biological processes, themselves. To identify context-specific changes in these inter-process connections, however, techniques beyond process-based expression analysis, which reports only perturbed processes and not their relationships, response networks, composed of interactions between genes rather than processes, and existing techniques in process connection detection, which do not incorporate specific biological context, proved necessary. We present two novel methods which take inspiration from the latest techniques in process-based gene expression analysis, computation of response networks, and computation of inter-process connections. We motivate the need for detecting inter-process connections by identifying a collection of processes exhibiting significant differences in collective expression in two liver tissue culture systems widely used in toxicological and pharmaceutical assays. Next, we identify perturbed connections between these processes via a novel method that integrates gene expression, interaction, and annotation data. Finally, we present another novel method that computes non-redundant sets of perturbed inter-process connections, and apply it to several additional liver-related data sets. These applications demonstrate the ability of our methods to capture and report biologically relevant high-level trends.
Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

34

Doineau, Raphaël. "Development of droplet-based microfluidic technology for high-throughput single-cell phenotypic screening of B cell repertoires." Thesis, Sorbonne Paris Cité, 2017. http://www.theses.fr/2017USPCC263/document.

Full text

Abstract:

Le système immunitaire adaptatif joue un rôle de premier plan dans la défense contre les infections. La réponse humorale, impliquant la production d'anticorps, est un élément important de la réponse immunitaire adaptative. Au cours d'une infection, des cellules B spécifiques du système immunitaire prolifèrent et libèrent de grandes quantités d'anticorps qui se lient sélectivement à la protéine cible (antigène) trouvée sur le pathogène invasif, induisant la destruction du pathogène.Cependant, le système immunitaire ne répond pas toujours suffisamment efficacement pour détruire les agents pathogènes, et les mécanismes de tolérance empêchent la génération d'anticorps contre les protéines humaines - comme les marqueurs de surface cellulaire sur les cellules cancéreuses ou les cytokines impliquées dans des maladies inflammatoires et auto-immunes - qui pourraient être des cibles thérapeutiques importantes. Par conséquent, il existe un grand intérêt pour la recherche et le développement d'anticorps spécifiques qui peuvent être utilisés pour le traitement des patients par immunothérapie. En raison de leur grande affinité et de leur liaison sélective aux antigènes, les anticorps monoclonaux (mAbs) sont apparus comme des agents thérapeutiques puissants. Les anticorps monoclonaux dérivés de cellules B individuelles ont une séquence unique et présentent une affinité de liaison pour un antigène spécifique. Cependant, jusqu'à maintenant, la découverte des mAbs a été limitée par l'absence de méthodes à haut débit pour le criblage direct et à grande échelle de cellules B primaires non immortalisées pour découvrir les rares cellules B qui produisent des anticorps spécifiques d'intérêt clinique. Ceci est maintenant possible avec l'émergence et l'amélioration des méthodes de compartimentation in vitro pour l'encapsulation et le criblage de cellules uniques dans des gouttelettes picolitriques. Dans mon projet de doctorat, je décris le développement d'immunodosages et de dispositifs microfluidiques pour le criblage phénotypique direct de cellules individuelles à partir de populations de cellules B enrichies. Ce développement a permis une analyse détaillée de la réponse immunitaire humorale, avec une résolution à l’échelle de la cellule unique. C’est aussi un élément essentiel d'un pipeline de détection d'anticorps couplant le criblage phénotypique de cellules individuelles au séquençage d'anticorps sur cellules uniques. Il est maintenant possible, pour la première fois, de cribler des millions de cellules B individuelles en fonction de l'activité de liaison des anticorps sécrétés et de récupérer les séquences d'anticorps
The adaptive immune system plays a leading role in defense against infection. The humoral response, involving the production of antibodies, is an important component of the adaptive immune response. During an infection, specific B cells of the immune system proliferate and release large amounts of antibodies which bind selectively to the target protein (antigen) found on the invading pathogen, inducing destruction of the pathogen. However, the immune system does not always respond efficiently enough to destroy pathogens, and tolerance mechanisms prevent the generation of antibodies against human protein - such as cell surface markers on cancer cells or cytokines involved in inflammatory and autoimmune disease - that could be important therapeutic targets. Hence, there is great interest in research and development of specific antibodies that can be used for immunotherapy of patients. Due to their high affinity and selective binding to antigens, monoclonal antibodies (mAbs) have emerged as powerful therapeutic agents. Monoclonal antibodies derived from single B cells have a unique sequence and display binding affinity for a specific antigen. However, until now, the discovery of mAbs has been limited by the lack of high-throughput methods for the direct and large-scale screening of non-immortalized primary B cells to uncover rare B cells which produce the specific antibodies of clinical interest. This is now becoming possible with the emergence and improvement of in vitro compartmentalization methods for single-cell encapsulation and screening in picoliter droplets. In my PhD project, I describe the development of binding immunoassays and microfluidic devices for the direct phenotypic screening of single-cells from enriched B cell populations. This development has enabled detailed analysis of the humoral immune response, with single-cell resolution and is an essential component of an antibody-discovery pipeline coupling single-cell phenotypic screening to single-cell antibody sequencing. It is now possible, for the first time, to screen millions of single B cells based on the binding activity of the secreted antibodies and to recover the antibody sequences

APA, Harvard, Vancouver, ISO, and other styles

35

Hänzelmann, Sonja 1981. "Pathway-centric approaches to the analysis of high-throughput genomics data." Doctoral thesis, Universitat Pompeu Fabra, 2012. http://hdl.handle.net/10803/108337.

Full text

Abstract:

In the last decade, molecular biology has expanded from a reductionist view to a systems-wide view that tries to unravel the complex interactions of cellular components. Owing to the emergence of high-throughput technology it is now possible to interrogate entire genomes at an unprecedented resolution. The dimension and unstructured nature of these data made it evident that new methodologies and tools are needed to turn data into biological knowledge. To contribute to this challenge we exploited the wealth of publicly available high-throughput genomics data and developed bioinformatics methodologies focused on extracting information at the pathway rather than the single gene level. First, we developed Gene Set Variation Analysis (GSVA), a method that facilitates the organization and condensation of gene expression proﬁles into gene sets. GSVA enables pathway-centric downstream analyses of microarray and RNA-seq gene expression data. The method estimates sample-wise pathway variation over a population and allows for the integration of heterogeneous biological data sources with pathway-level expression measurements. To illustrate the features of GSVA, we applied it to several use-cases employing diﬀerent data types and addressing biological questions. GSVA is made available as an R package within the Bioconductor project. Secondly, we developed a pathway-centric genome-based strategy to reposition drugs in type 2 diabetes (T2D). This strategy consists of two steps, ﬁrst a regulatory network is constructed that is used to identify disease driving modules and then these modules are searched for compounds that might target them. Our strategy is motivated by the observation that disease genes tend to group together in the same neighborhood forming disease modules and that multiple genes might have to be targeted simultaneously to attain an eﬀect on the pathophenotype. To ﬁnd potential compounds, we used compound exposed genomics data deposited in public databases. We collected about 20,000 samples that have been exposed to about 1,800 compounds. Gene expression can be seen as an intermediate phenotype reﬂecting underlying dysregulatory pathways in a disease. Hence, genes contained in the disease modules that elicit similar transcriptional responses upon compound exposure are assumed to have a potential therapeutic eﬀect. We applied the strategy to gene expression data of human islets from diabetic and healthy individuals and identiﬁed four potential compounds, methimazole, pantoprazole, bitter orange extract and torcetrapib that might have a positive eﬀect on insulin secretion. This is the ﬁrst time a regulatory network of human islets has been used to reposition compounds for T2D. In conclusion, this thesis contributes with two pathway-centric approaches to important bioinformatic problems, such as the assessment of biological function and in silico drug repositioning. These contributions demonstrate the central role of pathway-based analyses in interpreting high-throughput genomics data.
En l'última dècada, la biologia molecular ha evolucionat des d'una perspectiva reduccionista cap a una perspectiva a nivell de sistemes que intenta desxifrar les complexes interaccions entre els components cel•lulars. Amb l'aparició de les tecnologies d'alt rendiment actualment és possible interrogar genomes sencers amb una resolució sense precedents. La dimensió i la naturalesa desestructurada d'aquestes dades ha posat de manifest la necessitat de desenvolupar noves eines i metodologies per a convertir aquestes dades en coneixement biològic. Per contribuir a aquest repte hem explotat l'abundància de dades genòmiques procedents d'instruments d'alt rendiment i disponibles públicament, i hem desenvolupat mètodes bioinformàtics focalitzats en l'extracció d'informació a nivell de via molecular en comptes de fer-ho al nivell individual de cada gen. En primer lloc, hem desenvolupat GSVA (Gene Set Variation Analysis), un mètode que facilita l'organització i la condensació de perfils d'expressió dels gens en conjunts. GSVA possibilita anàlisis posteriors en termes de vies moleculars amb dades d'expressió gènica provinents de microarrays i RNA-seq. Aquest mètode estima la variació de les vies moleculars a través d'una població de mostres i permet la integració de fonts heterogènies de dades biològiques amb mesures d'expressió a nivell de via molecular. Per il•lustrar les característiques de GSVA, l'hem aplicat a diversos casos usant diferents tipus de dades i adreçant qüestions biològiques. GSVA està disponible com a paquet de programari lliure per R dins el projecte Bioconductor. En segon lloc, hem desenvolupat una estratègia centrada en vies moleculars basada en el genoma per reposicionar fàrmacs per la diabetis tipus 2 (T2D). Aquesta estratègia consisteix en dues fases: primer es construeix una xarxa reguladora que s'utilitza per identificar mòduls de regulació gènica que condueixen a la malaltia; després, a partir d'aquests mòduls es busquen compostos que els podrien afectar. La nostra estratègia ve motivada per l'observació que els gens que provoquen una malaltia tendeixen a agrupar-se, formant mòduls patogènics, i pel fet que podria caldre una actuació simultània sobre múltiples gens per assolir un efecte en el fenotipus de la malaltia. Per trobar compostos potencials, hem usat dades genòmiques exposades a compostos dipositades en bases de dades públiques. Hem recollit unes 20.000 mostres que han estat exposades a uns 1.800 compostos. L'expressió gènica es pot interpretar com un fenotip intermedi que reflecteix les vies moleculars desregulades subjacents a una malaltia. Per tant, considerem que els gens d'un mòdul patològic que responen, a nivell transcripcional, d'una manera similar a l'exposició del medicament tenen potencialment un efecte terapèutic. Hem aplicat aquesta estratègia a dades d'expressió gènica en illots pancreàtics humans corresponents a individus sans i diabètics, i hem identificat quatre compostos potencials (methimazole, pantoprazole, extracte de taronja amarga i torcetrapib) que podrien tenir un efecte positiu sobre la secreció de la insulina. Aquest és el primer cop que una xarxa reguladora d'illots pancreàtics humans s'ha utilitzat per reposicionar compostos per a T2D. En conclusió, aquesta tesi aporta dos enfocaments diferents en termes de vies moleculars a problemes bioinformàtics importants, com ho son el contrast de la funció biològica i el reposicionament de fàrmacs "in silico". Aquestes contribucions demostren el paper central de les anàlisis basades en vies moleculars a l'hora d'interpretar dades genòmiques procedents d'instruments d'alt rendiment.

APA, Harvard, Vancouver, ISO, and other styles

36

Swamy, Sajani. "The automation of glycopeptide discovery in high throughput MS/MS data." Thesis, Waterloo, Ont. : University of Waterloo, 2004. http://etd.uwaterloo.ca/etd/sswamy2004.pdf.

Full text

Abstract:

Thesis (MMath)--University of Waterloo, 2004.
"A thesis presented to the University of Waterloo in fulfilment of the thesis requirement for the degree of Master of Mathematics in Computer Science." Includes bibliographical references.

APA, Harvard, Vancouver, ISO, and other styles

37

Mammana, Alessandro [Verfasser]. "Patterns and algorithms in high-throughput sequencing count data / Alessandro Mammana." Berlin : Freie Universität Berlin, 2016. http://d-nb.info/1108270956/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

38

Love, Michael I. [Verfasser]. "Statistical analysis of high-throughput sequencing count data / Michael I. Love." Berlin : Freie Universität Berlin, 2013. http://d-nb.info/1043197842/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

39

Kostadinov, Ivaylo [Verfasser]. "Marine Metagenomics: From high-throughput data to ecogenomic interpretation / Ivaylo Kostadinov." Bremen : IRC-Library, Information Resource Center der Jacobs University Bremen, 2012. http://d-nb.info/1035211564/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

40

Bao, Suying, and 鲍素莹. "Deciphering the mechanisms of genetic disorders by high throughput genomic data." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2013. http://hdl.handle.net/10722/196471.

Full text

Abstract:

A new generation of non-Sanger-based sequencing technologies, so called “next-generation” sequencing (NGS), has been changing the landscape of genetics at unprecedented speed. In particular, our capacity in deciphering the genotypes underlying phenotypes, such as diseases, has never been greater. However, before fully applying NGS in medical genetics, researchers have to bridge the widening gap between the generation of massively parallel sequencing output and the capacity to analyze the resulting data. In addition, even a list of candidate genes with potential causal variants can be obtained from an effective NGS analysis, to pinpoint disease genes from the long list remains a challenge. The issue becomes especially difficult when the molecular basis of the disease is not fully elucidated. New NGS users are always bewildered by a plethora of options in mapping, assembly, variant calling and filtering programs and may have no idea about how to compare these tools and choose the “right” ones. To get an overview of various bioinformatics attempts in mapping and assembly, a series of performance evaluation work was conducted by using both real and simulated NGS short reads. For NGS variant detection, the performances of two most widely used toolkits were assessed, namely, SAM tools and GATK. Based on the results of systematic evaluation, a NGS data processing and analysis pipeline was constructed. And this pipeline was proved a success with the identification of a mutation (a frameshift deletion on Hnrnpa1, p.Leu181Valfs*6) related to congenital heart defect (CHD) in procollagen type IIA deficient mice. In order to prioritize risk genes for diseases, especially those with limited prior knowledge, a network-based gene prioritization model was constructed. It consists of two parts: network analysis on known disease genes (seed-based network strategy)and network analysis on differential expression (DE-based network strategy). Case studies of various complex diseases/traits demonstrated that the DE-based network strategy can greatly outperform traditional gene expression analysis in predicting disease-causing genes. A series of simulation work indicated that the DE-based strategy is especially meaningful to diseases with limited prior knowledge, and the model’s performance can be further advanced by integrating with seed-based network strategy. Moreover, a successful application of the network-based gene prioritization model in influenza host genetic study further demonstrated the capacity of the model in identifying promising candidates and mining of new risk genes and pathways not biased toward our current knowledge. In conclusion, an efficient NGS analysis framework from the steps of quality control and variant detection, to those of result analysis and gene prioritization has been constructed for medical genetics. The novelty in this framework is an encouraging attempt to prioritize risk genes for not well-characterized diseases by network analysis on known disease genes and differential expression data. The successful applications in detecting genetic factors associated with CHD and influenza host resistance demonstrated the efficacy of this framework. And this may further stimulate more applications of high throughput genomic data in dissecting the genetic components of human disorders in the near future.
published_or_final_version
Biochemistry
Doctoral
Doctor of Philosophy

APA, Harvard, Vancouver, ISO, and other styles

41

Wang, Dao Sen. "Conditional Differential Expression for Biomarker Discovery In High-throughput Cancer Data." Thesis, Université d'Ottawa / University of Ottawa, 2019. http://hdl.handle.net/10393/38819.

Full text

Abstract:

Biomarkers have important clinical uses as diagnostic, prognostic, and predictive tools for cancer therapy. However, translation from biomarkers claimed in literature to clinical use has been traditionally poor. Importantly, clinical covariates have been shown to be important factors in biomarker discovery in small-scale studies. Yet, traditional differential gene expression analysis for expression biomarkers ignores covariates, which are only accounted for later, if at all. We conjecture that covariate-sensitive biomarker identification should lead to the discovery of more robust and true biomarkers as confounding effects are considered. Here we examine gene expression in more than 750 breast invasive ductal carcinoma cases from The Cancer Genome Atlas (TCGA-BRCA) in the form of RNA-Seq data. Specifically, we focus on differential gene expression with respect to understanding HER2, ER, and PR biology – the three key receptors in breast cancer. We explore methods of differential expression analysis, including non-parametric Mann-Whitney-Wilcoxon analysis, generalized linear models with covariates, and a novel categorical method for covariates. We tested the influence of common patient characteristics, such as age and race, and clinical covariates such as HER2, ER, and PR receptor statuses. More importantly, we show that inclusion of a correlated covariate (e.g. PR status as a covariate in ER analysis) substantially changes the list of differentially expressed genes, removing many likely false positives and revealing genes obscured by the covariate. Incorporation of relevant covariates in differential gene expression analysis holds strong biological importance with respect to biomarker discovery and may be the next step towards better translation of biomarkers to clinical use.

APA, Harvard, Vancouver, ISO, and other styles

42

Cao, Hongfei. "High-throughput Visual Knowledge Analysis and Retrieval in Big Data Ecosystems." Thesis, University of Missouri - Columbia, 2019. http://pqdtopen.proquest.com/#viewpdf?dispub=13877134.

Full text

Abstract:

Visual knowledge plays an important role in many highly skilled applications, such as medical diagnosis, geospatial image analysis and pathology diagnosis. Medical practitioners are able to interpret and reason about diagnostic images based on not only primitive-level image features such as color, texture, and spatial distribution but also their experience and tacit knowledge which are seldom articulated explicitly. This reasoning process is dynamic and closely related to real-time human cognition. Due to a lack of visual knowledge management and sharing tools, it is difficult to capture and transfer such tacit and hard-won expertise to novices. Moreover, many mission-critical applications require the ability to process such tacit visual knowledge in real time. Precisely how to index this visual knowledge computationally and systematically still poses a challenge to the computing community.

My dissertation research results in novel computational approaches for highthroughput visual knowledge analysis and retrieval from large-scale databases using latest technologies in big data ecosystems. To provide a better understanding of visual reasoning, human gaze patterns are qualitatively measured spatially and temporally to model observers’ cognitive process. These gaze patterns are then indexed in a NoSQL distributed database as a visual knowledge repository, which is accessed using various unique retrieval methods developed through this dissertation work. To provide meaningful retrievals in real time, deep-learning methods for automatic annotation of visual activities and streaming similarity comparisons are developed under a gaze-streaming framework using Apache Spark.

This research has several potential applications that offer a broader impact among the scientific community and in the practical world. First, the proposed framework can be adapted for different domains, such as fine arts, life sciences, etc. with minimal effort to capture human reasoning processes. Second, with its real-time visual knowledge search function, this framework can be used for training novices in the interpretation of domain images, by helping them learn experts’ reasoning processes. Third, by helping researchers to understand human visual reasoning, it may shed light on human semantics modeling. Finally, integrating reasoning process with multimedia data, future retrieval of media could embed human perceptual reasoning for database search beyond traditional content-based media retrievals.

APA, Harvard, Vancouver, ISO, and other styles

43

Ballinger, Tracy J. "Analysis of genomic rearrangements in cancer from high throughput sequencing data." Thesis, University of California, Santa Cruz, 2015. http://pqdtopen.proquest.com/#viewpdf?dispub=3729995.

Full text

Abstract:

In the last century cancer has become increasingly prevalent and is the second largest killer in the United States, estimated to afflict 1 in 4 people during their life. Despite our long history with cancer and our herculean efforts to thwart the disease, in many cases we still do not understand the underlying causes or have successful treatments. In my graduate work, I’ve developed two approaches to the study of cancer genomics and applied them to the whole genome sequencing data of cancer patients from The Cancer Genome Atlas (TCGA). In collaboration with Dr. Ewing, I built a pipeline to detect retrotransposon insertions from paired-end high-throughput sequencing data and found somatic retrotransposon insertions in a fifth of cancer patients.

My second novel contribution to the study of cancer genomics is the development of the CN-AVG pipeline, a method for reconstructing the evolutionary history of a single tumor by predicting the order of structural mutations such as deletions, duplications, and inversions. The CN-AVG theory was developed by Drs. Haussler, Zerbino, and Paten and samples potential evolutionary histories for a tumor using Markov Chain Monte Carlo sampling. I contributed to the development of this method by testing its accuracy and limitations on simulated evolutionary histories. I found that the ability to reconstruct a history decays exponentially with increased breakpoint reuse, but that we can estimate how accurately we reconstruct a mutation event using the likelihood scores of the events. I further designed novel techniques for the application of CN-AVG to whole genome sequencing data from actual patients and applied these techniques to search for evolutionary patterns in glioblastoma multiforme using sequencing data from TCGA. My results show patterns of two-hit deletions, as we would expect, and amplifications occurring over several mutational events. I also find that the CN-AVG method frequently makes use of whole chromosome copy number changes following by localized deletions, a bias that could be mitigated through modifying the cost function for an evolutionary history.

APA, Harvard, Vancouver, ISO, and other styles

44

Yu, Guoqiang. "Machine Learning to Interrogate High-throughput Genomic Data: Theory and Applications." Diss., Virginia Tech, 2011. http://hdl.handle.net/10919/28980.

Full text

Abstract:

The missing heritability in genome-wide association studies (GWAS) is an intriguing open scientific problem which has attracted great recent interest. The interaction effects among risk factors, both genetic and environmental, are hypothesized to be one of the main missing heritability sources. Moreover, detection of multilocus interaction effect may also have great implications for revealing disease/biological mechanisms, for accurate risk prediction, personalized clinical management, and targeted drug design. However, current analysis of GWAS largely ignores interaction effects, partly due to the lack of tools that meet the statistical and computational challenges posed by taking into account interaction effects. Here, we propose a novel statistically-based framework (Significant Conditional Association) for systematically exploring, assessing significance, and detecting interaction effect. Further, our SCA work has also revealed new theoretical results and insights on interaction detection, as well as theoretical performance bounds. Using in silico data, we show that the new approach has detection power significantly better than that of peer methods, while controlling the running time within a permissible range. More importantly, we applied our methods on several real data sets, confirming well-validated interactions with more convincing evidence (generating smaller p-values and requiring fewer samples) than those obtained through conventional methods, eliminating inconsistent results in the original reports, and observing novel discoveries that are otherwise undetectable. The proposed methods provide a useful tool to mine new knowledge from existing GWAS and generate new hypotheses for further research. Microarray gene expression studies provide new opportunities for the molecular characterization of heterogeneous diseases. Multiclass gene selection is an imperative task for identifying phenotype-associated mechanistic genes and achieving accurate diagnostic classification. Most existing multiclass gene selection methods heavily rely on the direct extension of two-class gene selection methods. However, simple extensions of binary discriminant analysis to multiclass gene selection are suboptimal and not well-matched to the unique characteristics of the multi-category classification problem. We report a simpler and yet more accurate strategy than previous works for multicategory classification of heterogeneous diseases. Our method selects the union of one-versus-everyone phenotypic up-regulated genes (OVEPUGs) and matches this gene selection with a one-versus-rest support vector machine. Our approach provides even-handed gene resources for discriminating both neighboring and well-separated classes, and intends to assure the statistical reproducibility and biological plausibility of the selected genes. We evaluated the fold changes of OVEPUGs and found that only a small number of high-ranked genes were required to achieve superior accuracy for multicategory classification. We tested the proposed OVEPUG method on six real microarray gene expression data sets (five public benchmarks and one in-house data set) and two simulation data sets, observing significantly improved performance with lower error rates, fewer marker genes, and higher performance sustainability, as compared to several widely-adopted gene selection and classification methods.
Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

45

Zucker, Mark Raymond. "Inferring Clonal Heterogeneity in Chronic Lymphocytic Leukemia From High-Throughput Data." The Ohio State University, 2019. http://rave.ohiolink.edu/etdc/view?acc_num=osu1554049121307262.

Full text

APA, Harvard, Vancouver, ISO, and other styles

46

Guennel, Tobias. "Statistical Methods for Normalization and Analysis of High-Throughput Genomic Data." VCU Scholars Compass, 2012. http://scholarscompass.vcu.edu/etd/2647.

Full text

Abstract:

High-throughput genomic datasets obtained from microarray or sequencing studies have revolutionized the field of molecular biology over the last decade. The complexity of these new technologies also poses new challenges to statisticians to separate biological relevant information from technical noise. Two methods are introduced that address important issues with normalization of array comparative genomic hybridization (aCGH) microarrays and the analysis of RNA sequencing (RNA-Seq) studies. Many studies investigating copy number aberrations at the DNA level for cancer and genetic studies use comparative genomic hybridization (CGH) on oligo arrays. However, aCGH data often suffer from low signal to noise ratios resulting in poor resolution of fine features. Bilke et al. showed that the commonly used running average noise reduction strategy performs poorly when errors are dominated by systematic components. A method called pcaCGH is proposed that significantly reduces noise using a non-parametric regression on technical covariates of probes to estimate systematic bias. Then a robust principal components analysis (PCA) estimates any remaining systematic bias not explained by technical covariates used in the preceding regression. The proposed algorithm is demonstrated on two CGH datasets measuring the NCI-60 cell lines utilizing NimbleGen and Agilent microarrays. The method achieves a nominal error variance reduction of 60%-65% as well as an 2-fold increase in signal to noise ratio on average, resulting in more detailed copy number estimates. Furthermore, correlations of signal intensity ratios of NimbleGen and Agilent arrays are increased by 40% on average, indicating a significant improvement in agreement between the technologies. A second algorithm called gamSeq is introduced to test for differential gene expression in RNA sequencing studies. Limitations of existing methods are outlined and the proposed algorithm is compared to these existing algorithms. Simulation studies and real data are used to show that gamSeq improves upon existing methods with regards to type I error control while maintaining similar or better power for a range of sample sizes for RNA-Seq studies. Furthermore, the proposed method is applied to detect differential 3' UTR usage.

APA, Harvard, Vancouver, ISO, and other styles

47

Ferber, Kyle L. "Methods for Predicting an Ordinal Response with High-Throughput Genomic Data." VCU Scholars Compass, 2016. http://scholarscompass.vcu.edu/etd/4585.

Full text

Abstract:

Multigenic diagnostic and prognostic tools can be derived for ordinal clinical outcomes using data from high-throughput genomic experiments. A challenge in this setting is that the number of predictors is much greater than the sample size, so traditional ordinal response modeling techniques must be exchanged for more specialized approaches. Existing methods perform well on some datasets, but there is room for improvement in terms of variable selection and predictive accuracy. Therefore, we extended an impressive binary response modeling technique, Feature Augmentation via Nonparametrics and Selection, to the ordinal response setting. Through simulation studies and analyses of high-throughput genomic datasets, we showed that our Ordinal FANS method is sensitive and specific when discriminating between important and unimportant features from the high-dimensional feature space and is highly competitive in terms of predictive accuracy. Discrete survival time is another example of an ordinal response. For many illnesses and chronic conditions, it is impossible to record the precise date and time of disease onset or relapse. Further, the HIPPA Privacy Rule prevents recording of protected health information which includes all elements of dates (except year), so in the absence of a “limited dataset,” date of diagnosis or date of death are not available for calculating overall survival. Thus, we developed a method that is suitable for modeling high-dimensional discrete survival time data and assessed its performance by conducting a simulation study and by predicting the discrete survival times of acute myeloid leukemia patients using a high-dimensional dataset.

APA, Harvard, Vancouver, ISO, and other styles

48

Glaus, Peter. "Bayesian methods for gene expression analysis from high-throughput sequencing data." Thesis, University of Manchester, 2014. https://www.research.manchester.ac.uk/portal/en/theses/bayesian-methods-for-gene-expression-analysis-from-highthroughput-sequencing-data(cf9680e0-a3f2-4090-8535-a39f3ef50cc4).html.

Full text

Abstract:

We study the tasks of transcript expression quantification and differential expression analysis based on data from high-throughput sequencing of the transcriptome (RNA-seq). In an RNA-seq experiment subsequences of nucleotides are sampled from a transcriptome specimen, producing millions of short reads. The reads can be mapped to a reference to determine the set of transcripts from which they were sequenced. We can measure the expression of transcripts in the specimen by determining the amount of reads that were sequenced from individual transcripts. In this thesis we propose a new probabilistic method for inferring the expression of transcripts from RNA-seq data. We use a generative model of the data that can account for read errors, fragment length distribution and non-uniform distribution of reads along transcripts. We apply the Bayesian inference approach, using the Gibbs sampling algorithm to sample from the posterior distribution of transcript expression. Producing the full distribution enables assessment of the uncertainty of the estimated expression levels. We also investigate the use of alternative inference techniques for the transcript expression quantification. We apply a collapsed Variational Bayes algorithm which can provide accurate estimates of mean expression faster than the Gibbs sampling algorithm. Building on the results from transcript expression quantification, we present a new method for the differential expression analysis. Our approach utilizes the full posterior distribution of expression from multiple replicates in order to detect significant changes in abundance between different conditions. The method can be applied to differential expression analysis of both genes and transcripts. We use the newly proposed methods to analyse real RNA-seq data and provide evaluation of their accuracy using synthetic datasets. We demonstrate the advantages of our approach in comparisons with existing alternative approaches for expression quantification and differential expression analysis. The methods are implemented in the BitSeq package, which is freely distributed under an open-source license. Our methods can be accessed and used by other researchers for RNA-seq data analysis.

APA, Harvard, Vancouver, ISO, and other styles

49

Paicu, Claudia. "miRNA detection and analysis from high-throughput small RNA sequencing data." Thesis, University of East Anglia, 2016. https://ueaeprints.uea.ac.uk/63738/.

Full text

Abstract:

Small RNAs (sRNAs) are a broad class of short regulatory non-coding RNAs. microRNAs (miRNAs) are a special class of -21-22 nucleotide sRNAs which are derived from a stable hairpin-like secondary structure. miRNAs have critical gene regulatory functions and are involved in many pathways including developmental timing, organogenesis and development in both plants and animals. Next generation sequencing (NGS) technologies, which are often used for identifying miRNAs, are continuously evolving, generating datasets containing millions of sRNAs, which has led to new challenges for the tools used to predict miRNAs from such data. There are several tools for miRNA detection from NGS datasets, which we review in this thesis, identifying a number of potential shortcomings in their algorithms. In this thesis, we present a novel miRNA prediction algorithm, miRCat2. Our algorithm is more robust to variations in sequencing depth due to the fact that it compares aligned sRNA reads to a random uniform distribution to detect peaks in the input dataset, using a new entropy-based approach. Then it applies filters based on the miRNA biogenesis on the read alignment and on the computed secondary structure. Results show that miRCat2 has a better specificity-sensitivity trade-off than similar tools, and its predictions also contains a larger percentage of sequences that are downregulated in mutants in the miRNA biogenesis pathway. This confirms the validity of novel predictions, which may lead to new miRNA annotations, expanding and contributing to the field of sRNA research.

APA, Harvard, Vancouver, ISO, and other styles

50

Malo, Nathalie. "Statistical contributions to data analysis for high-throughput screening of chemical compounds." Thesis, McGill University, 2006. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=102680.

Full text

Abstract:

High-throughput Screening (HTS) is a relatively new process which allows several thousand chemical compounds to be tested rapidly in order to identify their potential as drug candidates. Despite increasing numbers of promising candidates, however, the numbers of new compounds that ultimately reach the market have declined. One way to improve upon this situation is to develop efficient and accurate data processing and statistical testing methods tailored for HTS. Human, biological or mechanical errors may develop across the several days it takes to run the entire screen and cause unwanted variation or "noise". Consequently, HTS data need to be preprocessed in order to reduce the effect of systematic errors. Robust statistical methods for outlier detection can then be applied to identify the most promising compounds. Current practice typically uses only single measurements, which negates the use of standard statistical methods and forces scientists to rely on strong untested assumptions and on arbitrary choices of significance thresholds.
The broad objectives of this research are to develop and evaluate robust and reliable statistical methods for both data preprocessing and statistical inference. This thesis is divided into three papers. The first manuscript is a critical review of the current practices in HTS data analysis. It includes several recommendations for improving sensitivity and specificity of screens. The second manuscript compares the performance of different robust preprocessing methods applied to replicated two-way data with respect to detection of outlying cells. The third manuscript evaluates some of the statistical methods described in the first manuscript with respect to their performance when applied to several empirical data sets.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'High Throughput Phenotypic Data'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles