Dissertations / Theses: 'Omics data analysi'

1

MASPERO, DAVIDE. "Computational strategies to dissect the heterogeneity of multicellular systems via multiscale modelling and omics data analysis." Doctoral thesis, Università degli Studi di Milano-Bicocca, 2022. http://hdl.handle.net/10281/368331.

Full text

Abstract:

L'eterogeneità pervade i sistemi biologici e si manifesta in differenze strutturali e funzionali osservate sia tra diversi individui di uno stesso gruppo (es. organismi o patologie), sia fra gli elementi costituenti di un singolo individuo (es. cellule). Lo studio dell’eterogeneità dei sistemi biologici e, in particolare, di quelli multicellulari è fondamentale per la comprensione meccanicistica di fenomeni fisiologici e patologici complessi (es. il cancro), così come per la definizione di strategie prognostiche, diagnostiche e terapeutiche efficaci. Questo lavoro è focalizzato sullo sviluppo e l’applicazione di metodi computazionali e modelli matematici per la caratterizzazione dell’eterogeneità di sistemi multicellulari e delle sottopopolazioni di cellule tumorali che sottendono l’evoluzione di una patologia neoplastica. Analoghe metodologie sono state sviluppate per caratterizzare efficacemente l’evoluzione e l’eterogeneità virale. La ricerca è suddivisa in due porzioni complementari, la prima finalizzata alla definizione di metodi per l’analisi e l’integrazione di dati omici generati da esperimenti di sequenziamento, la seconda alla modellazione e simulazione multiscala di sistemi multicellulari. Per quanto riguarda il primo filone, le tecnologie di next-generation sequencing permettono di generare enormi moli di dati omici, relativi per esempio al genoma o trascrittoma di un determinato individuo, attraverso esperimenti di bulk o single-cell sequencing. Una delle sfide principale in informatica è quella di definire metodi computazionali per estrarre informazione utile da tali dati, tenendo conto degli alti livelli di errori dato-specifico, dovuti principalmente a limiti tecnologici. In particolare, nell’ambito di questo lavoro, ci si è concentrati sullo sviluppo di metodi per l’analisi di dati di espressione genica e di mutazioni genomiche. In dettaglio, è stata effettuata una comparazione esaustiva dei metodi di machine-learning per il denoising e l’imputation di dati di single-cell RNA-sequencing. Inoltre, sono stati sviluppati metodi per il mapping dei profili di espressione su reti metaboliche, attraverso un framework innovativo che ha consentito di stratificare pazienti oncologici in base al loro metabolismo. Una successiva estensione del metodo ha permesso di analizzare la distribuzione dei flussi metabolici all'interno di una popolazione di cellule, via un approccio di flux balance analysis. Per quanto riguarda l’analisi dei profili mutazionali, è stato ideato e implementato il primo metodo per la ricostruzione di modelli filogenomici a partire da dati longitudinali a risoluzione single-cell, che sfrutta un framework che combina una Markov Chain Monte Carlo con una nuova funzione di likelihood pesata. Analogamente, è stato sviluppato un framework che sfrutta i profili delle mutazioni a bassa frequenza per ricostruire filogenie robuste e probabili catene di infenzione, attraverso l’analisi dei dati di sequenziamento di campioni virali. Gli stessi profili mutazionali permettono anche di deconvolvere il segnale nelle firme associati a specifici meccanismi molecolari che generano tali mutazioni, attraverso un approccio basato su non-negative matrix factorization. La ricerca condotta per quello che riguarda la simulazione computazionale ha portato allo sviluppo di un modello multiscala, in cui la simulazione della dinamica di popolazioni cellulari, rappresentata attraverso un Cellular Potts Model, è accoppiata all'ottimizzazione di un modello metabolico associato a ciascuna cellula sintetica. Co modello è possibile rappresentare ipotesi in termini matematici e osservare proprietà emergenti da tali assunti. Infine, un primo tentativo per combinare i due approcci metodologici ha condotto all'integrazione di dati di single-cell RNA-seq all'interno del modello multiscala, consentendo di formulare ipotesi data-driven sulle proprietà emergenti del sistema.
Heterogeneity pervades biological systems and manifests itself in the structural and functional differences observed both among different individuals of the same group (e.g., organisms or disease systems) and among the constituent elements of a single individual (e.g., cells). The study of the heterogeneity of biological systems and, in particular, of multicellular systems is fundamental for the mechanistic understanding of complex physiological and pathological phenomena (e.g., cancer), as well as for the definition of effective prognostic, diagnostic, and therapeutic strategies. This work focuses on developing and applying computational methods and mathematical models for characterising the heterogeneity of multicellular systems and, especially, cancer cell subpopulations underlying the evolution of neoplastic pathology. Similar methodologies have been developed to characterise viral evolution and heterogeneity effectively. The research is divided into two complementary portions, the first aimed at defining methods for the analysis and integration of omics data generated by sequencing experiments, the second at modelling and multiscale simulation of multicellular systems. Regarding the first strand, next-generation sequencing technologies allow us to generate vast amounts of omics data, for example, related to the genome or transcriptome of a given individual, through bulk or single-cell sequencing experiments. One of the main challenges in computer science is to define computational methods to extract useful information from such data, taking into account the high levels of data-specific errors, mainly due to technological limitations. In particular, in the context of this work, we focused on developing methods for the analysis of gene expression and genomic mutation data. In detail, an exhaustive comparison of machine-learning methods for denoising and imputation of single-cell RNA-sequencing data has been performed. Moreover, methods for mapping expression profiles onto metabolic networks have been developed through an innovative framework that has allowed one to stratify cancer patients according to their metabolism. A subsequent extension of the method allowed us to analyse the distribution of metabolic fluxes within a population of cells via a flux balance analysis approach. Regarding the analysis of mutational profiles, the first method for reconstructing phylogenomic models from longitudinal data at single-cell resolution has been designed and implemented, exploiting a framework that combines a Markov Chain Monte Carlo with a novel weighted likelihood function. Similarly, a framework that exploits low-frequency mutation profiles to reconstruct robust phylogenies and likely chains of infection has been developed by analysing sequencing data from viral samples. The same mutational profiles also allow us to deconvolve the signal in the signatures associated with specific molecular mechanisms that generate such mutations through an approach based on non-negative matrix factorisation. The research conducted with regard to the computational simulation has led to the development of a multiscale model, in which the simulation of cell population dynamics, represented through a Cellular Potts Model, is coupled to the optimisation of a metabolic model associated with each synthetic cell. Using this model, it is possible to represent assumptions in mathematical terms and observe properties emerging from these assumptions. Finally, we present a first attempt to combine the two methodological approaches which led to the integration of single-cell RNA-seq data within the multiscale model, allowing data-driven hypotheses to be formulated on the emerging properties of the system.

APA, Harvard, Vancouver, ISO, and other styles

2

Wang, Zhi. "Module-Based Analysis for "Omics" Data." Thesis, North Carolina State University, 2015. http://pqdtopen.proquest.com/#viewpdf?dispub=3690212.

Full text

Abstract:

This thesis focuses on methodologies and applications of module-based analysis (MBA) in omics studies to investigate the relationships of phenotypes and biomarkers, e.g., SNPs, genes, and metabolites. As an alternative to traditional single–biomarker approaches, MBA may increase the detectability and reproducibility of results because biomarkers tend to have moderate individual effects but significant aggregate effect; it may improve the interpretability of findings and facilitate the construction of follow-up biological hypotheses because MBA assesses biomarker effects in a functional context, e.g., pathways and biological processes. Finally, for exploratory “omics” studies, which usually begin with a full scan of a long list of candidate biomarkers, MBA provides a natural way to reduce the total number of tests, and hence relax the multiple-testing burdens and improve power.

The first MBA project focuses on genetic association analysis that assesses the main and interaction effects for sets of genetic (G) and environmental (E) factors rather than for individual factors. We develop a kernel machine regression approach to evaluate the complete effect profile (i.e., the G, E, and G-by-E interaction effects separately or in combination) and construct a kernel function for the Gene-Environmental (GE) interaction directly from the genetic kernel and the environmental kernel. We use simulation studies and real data applications to show improved performance of the Kernel Machine (KM) regression method over the commonly adapted PC regression methods across a wide range of scenarios. The largest gain in power occurs when the underlying effect structure is involved complex GE interactions, suggesting that the proposed method could be a useful and powerful tool for performing exploratory or confirmatory analyses in GxE-GWAS.

In the second MBA project, we extend the kernel machine framework developed in the first project to model biomarkers with network structure. Network summarizes the functional interplay among biological units; incorporating network information can more precisely model the biological effects, enhance the ability to detect true signals, and facilitate our understanding of the underlying biological mechanisms. In the work, we develop two kernel functions to capture different network structure information. Through simulations and metabolomics study, we show that the proposed network-based methods can have markedly improved power over the approaches ignoring network information.

Metabolites are the end products of cellular processes and reflect the ultimate responses of biology system to genetic variations or environment exposures. Because of the unique properties of metabolites, pharmcometabolomics aims to understand the underlying signatures that contribute to individual variations in drug responses and identify biomarkers that can be helpful to response predictions. To facilitate mining pharmcometabolomic data, we establish an MBA pipeline that has great practical value in detection and interpretation of signatures, which may potentially indicate a functional basis for the drug response. We illustrate the utilities of the pipeline by investigating two scientific questions in aspirin study: (1) which metabolites changes can be attributed to aspirin intake, and (2) what are the metabolic signatures that can be helpful in predicting aspirin resistance. Results show that the MBA pipeline enables us to identify metabolic signatures that are not found in preliminary single-metabolites analysis.

APA, Harvard, Vancouver, ISO, and other styles

3

Zheng, Ning. "Mediation modeling and analysis forhigh-throughput omics data." Thesis, Uppsala universitet, Statistiska institutionen, 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-256318.

Full text

Abstract:

There is a strong need for powerful unified statistical methods for discovering underlying genetic architecture of complex traits with the assistance of omics information. In this paper, two methods aiming to detect novel association between the human genome and complex traits using intermediate omics data are developed based on statistical mediation modeling. We demonstrate theoretically that given proper mediators, the proposed statistical mediation models have better power than genome-wide association studies (GWAS) to detect associations missed in standard GWAS that ignore the mediators. For each ofthe modeling methods in this paper, an empirical example is given, where the association between a SNP and BMI missed by standard GWAS can be discovered by mediation analysis.

APA, Harvard, Vancouver, ISO, and other styles

4

Campanella, Gianluca. "Statistical analysis of '-omics' data : developments and applications." Thesis, Imperial College London, 2015. http://hdl.handle.net/10044/1/32109.

Full text

Abstract:

In recent years, increasingly efficient molecular biology techniques created new opportunities to harness large-scale repositories of biological material collected in epidemiological studies; however, methods to manipulate and analyse the wealth of information thus generated have lagged behind. The introductory chapter of this thesis presents the multifaceted field of 'computational epidemiology' from the perspectives of molecular biology, measurement theory, and statistical modelling. Focusing on measurement of DNA methylation levels, the author also reviews the state of the art, proposes novel pre-processing methods and evaluation frameworks, and provides recommendations for genome-wide studies of DNA methylation levels using Illumina Infinium® HumanMethylation450 BeadChips. The remaining chapters, in the form of three self-contained scientific articles, cover applications on the following topics: (i) DNA methylation differences associated with internal migration patterns within Italy; (ii) associations of DNA methylation profiles with adiposity measures, targeted gene expression, biomarkers of lipid and glucose metabolism, and risk of developing three obesity-associated diseases; (iii) associations of a dietary score with blood pressure, and with urinary metabolites as characterised by NMR spectroscopy. The thesis is concluded with general remarks and the presentation of some open problems that offer potential for future research.

APA, Harvard, Vancouver, ISO, and other styles

5

Budimir, Iva <1992&gt. "Stochastic Modeling and Correlation Analysis of Omics Data." Doctoral thesis, Alma Mater Studiorum - Università di Bologna, 2021. http://amsdottorato.unibo.it/9792/1/Budimir_Iva_tesi.pdf.

Full text

Abstract:

We studied the properties of three different types of omics data: protein domains in bacteria, gene length in metazoan genomes and methylation in humans. Gene elongation and protein domain diversification are some of the most important mechanisms in the evolution of functional complexity. For this reason, the investigation of the dynamic processes that led to their current configuration can highlight the important aspects of genome and proteome evolution and consequently of the evolution of living organisms. The potential of methylation to regulate the expression of genes is usually attributed to the groups of close CpG sites. We performed the correlation analysis to investigate the collaborative structure of all CpGs on chromosome 21. The long-tailed distributions of gene length and protein domain occurrences were successfully described by the stochastic evolutionary model and fitted with the Poisson Log-Normal distribution. This approach included both demographic and environmental stochasticity and the Gompertzian density regulation. The parameters of the fitted distributions were compared at the evolutionary scale. This allowed us to define a novel protein-domain-based phylogenetic method for bacteria which performed well at the intraspecies level. In the context of gene length distribution, we derived a new generalized population dynamics model for diverse subcommunities which allowed us to jointly model both coding and non-coding genomic sequences. A possible application of this approach is a method for differentiation between protein-coding genes and pseudogenes based on their length. General properties of the methylation correlation structure were firstly analyzed for the large data set of healthy controls and later compared to the Down syndrome (DS) data set. The CpGs demonstrated strong group behaviour even across the large genomic distances. Detected differences in DS were surprisingly small, possibly caused by the small sample size of DS which reduced the power of statistical analysis.

APA, Harvard, Vancouver, ISO, and other styles

6

Kim, Jieun. "Computational tools for the integrative analysis of muti-omics data to decipher trans-omics networks." Thesis, The University of Sydney, 2022. https://hdl.handle.net/2123/28524.

Full text

Abstract:

Regulatory networks define the phenotype, morphology, and function of cells. These networks are built from the basic building blocks of the cell—DNA, RNA, and proteins—and cut across the respective omics layers—genome, transcriptome, and proteome. The resulting omics networks depict a near infinite possibility of nodes and edges that intricately connect the ‘omes’. With the rapid advancement in the technologies that generate omics data in bulk samples and now at single-cell resolution, the field of life sciences is now met with the challenge to connect these omes to generate trans-omics networks. To this end, this thesis addressed some of the pressing challenges in trans-omics network reconstruction and the integrative analysis of omics data at both bulk and single-cell resolution: 1) the lack of an integrated pipeline for processing and downstream analysis of lesser studied omics layers; 2) the need for an integrative framework to reconstruct transcriptional networks and discover novel regulators of transcriptional regulation; and 3) development of tools for the reconstruction of single-cell multi-modal TRNs. I envision the work of my thesis to contribute towards the integrative study of bulk and single-cell trans-omics analysis, which I believe will become essential and standard-place in molecular biological studies as the comprehensiveness and accuracy of omics data measurements and databases for connecting different omics improves.

APA, Harvard, Vancouver, ISO, and other styles

7

Ding, Hao. "Visualization and Integrative analysis of cancer multi-omics data." The Ohio State University, 2016. http://rave.ohiolink.edu/etdc/view?acc_num=osu1467843712.

Full text

APA, Harvard, Vancouver, ISO, and other styles

8

Castleberry, Alissa. "Integrated Analysis of Multi-Omics Data Using Sparse Canonical Correlation Analysis." The Ohio State University, 2019. http://rave.ohiolink.edu/etdc/view?acc_num=osu15544898045976.

Full text

APA, Harvard, Vancouver, ISO, and other styles

9

Tellaroli, Paola. "Three topics in omics research." Doctoral thesis, Università degli studi di Padova, 2015. http://hdl.handle.net/11577/3423912.

Full text

Abstract:

The rather generic title of this Thesis is due to the fact that several aspects of biological phenomena have been investigated. Most of this work was addressed at the investigation of the limitations of one of the essential tools for analyzing gene expression data: cluster analysis. With several hundred of clustering methods in existence, there is clearly no shortage of clustering algorithms but, at the same time, satisfactory answers to some basic questions are still to come. In particular, we present a novel algorithm for the clustering of static data and a new strategy for the clustering of short-length time-course data. Finally, we analyzed data coming from Cap Analysis Gene Expression, a relatively new technology useful for the genome-wide promoter analysis and still mostly unexplored.
Il titolo piuttosto generico di questa tesi è dovuto al fatto che sono stati indagati diversi aspetti di fenomeni biologici. La maggior parte di questo lavoro è stato rivolto alla ricerca dei limiti di uno degli strumenti essenziali per l'analisi di dati di espressione genica: l'analisi dei gruppi. Esistendo diverse centinaia di metodi di raggruppamento, chiaramente non c'è carenza di algoritmi di analisi dei gruppi, ma, allo stesso tempo, alcuni quesiti fondamentali non hanno ancora ricevuto risposte soddisfacenti. In particolare, presentiamo un nuovo algoritmo di analisi dei gruppi per dati statici ed una nuova strategia per il raggruppamento di dati temporali di breve lunghezza. Infine, abbiamo analizzato dati provenienti da una tecnologia relativamente nuova, chiamata Cap Analysis Gene Expression, utile per l'analisi dei promotori su tutto il genoma e ancora in gran parte inesplorata.

APA, Harvard, Vancouver, ISO, and other styles

10

Ayati, Marzieh. "Algorithms to Integrate Omics Data for Personalized Medicine." Case Western Reserve University School of Graduate Studies / OhioLINK, 2018. http://rave.ohiolink.edu/etdc/view?acc_num=case1527679638507616.

Full text

APA, Harvard, Vancouver, ISO, and other styles

11

Konrad, Attila. "Investigation of Pathway Analysis Tools for mapping omics data to pathways." Thesis, Malmö högskola, Fakulteten för teknik och samhälle (TS), 2014. http://urn.kb.se/resolve?urn=urn:nbn:se:mau:diva-20843.

Full text

Abstract:

Detta examensarbete granskar analysverktyg ur ett tvärvetenskapligt perspektiv. Det finns en hel del olika analysverktyg idag som analyserar specifika typer av omik data och därför undersöker vi hur många det finns samt vad de kan göra. Genom att definiera ett antal specifika krav såsom hur många typer av omik data den kan hantera, noggrannhet av verktygets analys så kan man se vilka som är mest lämpliga analysverktygen när det gäller kartläggning av omik data. Resultaten visar att det idag inte finns analysverktyg som uppfyller de specifikt angivna kraven eller huvudsyftet genom testning av programvaran. Ingenuity analysverktyget är det närmaste vi kan komma för de krav som vi söker. På begäran av slutanvändaren testades två analysverktyg för att se om en kombination av dessa kan uppfylla slut användarens krav. Analysverktyget Uniprot batch converter testas med FEvER men resultat är inte framgångsrikt, då kombinationen av dessa verktyg inte är bättre än Ingenuity analysverktyget. Fokus vänds mot en alternativ kombination som är en hemsida och heter NCBI. Hemsidan har en sökmotor kopplad till flera olika analysverktyg som är gratis att använda. Genom sökmotorn kan ”omik” data kombineras och mer än ett inmatat värde kan hanteras i taget. Eftersom tekniken snabbt går framåt innebär det däremot att nya analysverktyg behövs för data hantering och inom en snar framtid så har vi kanske ett analysverktyg som uppfyller kraven av slutanvändarna.
This thesis examines PATs from a multidisciplinary view. There are a lot of PAT's existing today analyzing specific type of omics data, therefore we investigate them and what they can do. By defining some specific requirements such as how many omics data types it can handle, the accuracy of the PAT can be obtained to get the most suitable PAT when it comes to mapping omics data to pathways. Results show that no PATs found today fulfills the specific set of requirements or the main goal though software testing. The Ingenuity PAT is the closest to fulfill the requirements. Requested by the end user, two PATs are tested in combination to see if these can fulfill the requirements of the end user. Uniprot batch converter was tested with FEvER and results did not turn out successfully since the combination of the two PATs is no better than the Ingenuity PAT. Focus then turned to an alternative combination, a homepage called NCBI that have search engines connected to several free PATs available thus fulfilling the requirements. Through the search engine “omics” data can be combined and more than one input can be taken at a time. Since technology is rapidly moving forward, the need for new tools for data interpretation also grows. It means that in a near future we may be able to find a PAT that fulfills the requirements of the end users.

APA, Harvard, Vancouver, ISO, and other styles

12

Lu, Yingzhou. "Multi-omics Data Integration for Identifying Disease Specific Biological Pathways." Thesis, Virginia Tech, 2018. http://hdl.handle.net/10919/83467.

Full text

Abstract:

Pathway analysis is an important task for gaining novel insights into the molecular architecture of many complex diseases. With the advancement of new sequencing technologies, a large amount of quantitative gene expression data have been continuously acquired. The springing up omics data sets such as proteomics has facilitated the investigation on disease relevant pathways. Although much work has previously been done to explore the single omics data, little work has been reported using multi-omics data integration, mainly due to methodological and technological limitations. While a single omic data can provide useful information about the underlying biological processes, multi-omics data integration would be much more comprehensive about the cause-effect processes responsible for diseases and their subtypes. This project investigates the combination of miRNAseq, proteomics, and RNAseq data on seven types of muscular dystrophies and control group. These unique multi-omics data sets provide us with the opportunity to identify disease-specific and most relevant biological pathways. We first perform t-test and OVEPUG test separately to define the differential expressed genes in protein and mRNA data sets. In multi-omics data sets, miRNA also plays a significant role in muscle development by regulating their target genes in mRNA dataset. To exploit the relationship between miRNA and gene expression, we consult with the commonly used gene library - Targetscan to collect all paired miRNA-mRNA and miRNA-protein co-expression pairs. Next, by conducting statistical analysis such as Pearson's correlation coefficient or t-test, we measured the biologically expected correlation of each gene with its upstream miRNAs and identify those showing negative correlation between the aforementioned miRNA-mRNA and miRNA-protein pairs. Furthermore, we identify and assess the most relevant disease-specific pathways by inputting the differential expressed genes and negative correlated genes into the gene-set libraries respectively, and further characterize these prioritized marker subsets using IPA (Ingenuity Pathway Analysis) or KEGG. We will then use Fisher method to combine all these p-values derived from separate gene sets into a joint significance test assessing common pathway relevance. In conclusion, we will find all negative correlated paired miRNA-mRNA and miRNA-protein, and identifying several pathophysiological pathways related to muscular dystrophies by gene set enrichment analysis. This novel multi-omics data integration study and subsequent pathway identification will shed new light on pathophysiological processes in muscular dystrophies and improve our understanding on the molecular pathophysiology of muscle disorders, preventing and treating disease, and make people become healthier in the long term.
Master of Science

APA, Harvard, Vancouver, ISO, and other styles

13

Lin, Yingxin. "Statistical modelling and machine learning for single cell data harmonisation and analysis." Thesis, The University of Sydney, 2022. https://hdl.handle.net/2123/28034.

Full text

Abstract:

Technological advances such as large-scale single-cell profiling have exploded in recent years and enabled unprecedented understanding of the behaviour of individual cells. Effectively harmonising multiple collections and different modalities of single-cell data and accurately annotating cell types using reference, which we consider as the step of “intermediate data analysis” in this thesis, serve as a foundation for the downstream analysis to uncover biological insights from single-cell data. This thesis proposed several statistical modelling and machine learning methods to address several challenges in intermediate data analysis in the single-cell omics era, including: (1) scMerge to effectively integrate multiple collections of single-cell RNA-sequencing (scRNA-seq) datasets from a single modality; (2) scClassify to annotate cell types for scRNA-seq data by capitalising on the large collection of well-annotated scRNA-seq datasets; and (3) scJoint to integrate unpaired atlas-scale single-cell multi-omics data and transfer labels from scRNA-seq datasets to scATAC-seq data. We illustrate that the proposed methods enable a novel and scalable workflow to integratively analyse large-cohort single-cell data, demonstrating using a collection of single-cell multi-omics COVID-19 datasets. In summary, this thesis contributes to single-cell research by developing effective, integrative and scalable methods towards a more comprehensive understanding of cellular phenotypes at single-cell resolution.

APA, Harvard, Vancouver, ISO, and other styles

14

Eichner, Johannes [Verfasser]. "Machine learning and statistical methods for preclinical omics data analysis / Johannes Eichner." München : Verlag Dr. Hut, 2015. http://d-nb.info/1079768874/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

15

Barcelona, Cabeza Rosa. "Genomics tools in the cloud: the new frontier in omics data analysis." Doctoral thesis, Universitat Politècnica de Catalunya, 2021. http://hdl.handle.net/10803/672757.

Full text

Abstract:

Substantial technological advancements in next generation sequencing (NGS) have revolutionized the genomic field. Over the last years, the speed and throughput of NGS technologies have increased while their costs have decreased, allowing us to achieve base-by-base interrogation of the human genome in an efficient and affordable way. All these advances have led to a growing application of NGS technologies in clinical practice to identify the genomics variations and their relationship with certain diseases. However, there is still the need to improve data accessibility, processing and interpretation due to both the huge amount of data generated by these sequencing technologies and the large number of tools available to process it. In addition to a large number of algorithms for variant discovery, each type of variation and data requires the use of a specific algorithm. Therefore, a solid background in bioinformatics is required to be able to select the most suitable algorithm in each case but also to be able to execute them successfully. On that basis, the aim of this project is to facilitate the processing of sequencing data for variant identification and interpretation for non-bioinformaticians. All this by creating high-performance workflows with a strong scientific basis, while remaining accessible and easy to use, as well as a simple and highly intuitive platform for data interpretation. An exhaustive bibliographic review has been carried out where the best existing algorithm has been selected to create automatic pipelines for the discovery of germline short variants (SNPs and indels) and germline structural variants (SVs), including both CNVs and chromosomal rearrangements, from modern human DNA. In addition to creating variant discovery pipelines, a pipeline has been implemented for in silico optimization of CNV detection from WES and TS data (isoCNV). This optimization pipeline has been shown to increase the sensitivity of CNV discovery using only NGS data. Such increased sensitivity is especially important for diagnosis in the clinical settings. Furthermore, a variant discovery workflow has been developed by integrating WES and RNA-seq data (varRED) that has been shown to increase the number of variants identified over those identified when only using WES data. It is important to note that variant discovery is not only important for modern populations, the study of the variation in ancient genomes is also essential to understand past human evolution. Thus, a germline short variant discovery pipeline from ancient WGS samples has been implemented. This workflow has been applied to a human mandible dated between 16980-16510 calibrated years before the present. The ancient short variants discovered were reported without further interpretation due to the low sample coverage. Finally, GINO has been implemented to facilitate the interpretation of the variants identified by the workflows developed in the context of this thesis. GINO is an easy-to-use platform for the visualization and interpretation of germline variants under user license. With the development of this thesis, it has been possible to implement the necessary tools for a high-performance identification of all types of germline variants, as well as a powerful platform to interpret the identified variants in a simple and fast way. Using this platform allows non-bioinformaticians to focus on interpreting results without having to worry about data processing with the guarantee of scientifically sound results. Furthermore, it has laid the foundations for implementing a platform for comprehensive analysis and visualization of genomic data in the cloud in the near future.
Los avances tecnológicos en la secuenciación de próxima generación (NGS) han revolucionado el campo de la genómica. El aumento de velocidad y rendimiento de las tecnologías NGS de los últimos años junto con la reducción de su coste ha permitido interrogar base por base el genoma humano de una manera eficiente y asequible. Todos estos avances han permitido incrementar el uso de las tecnologías NGS en la práctica clínica para la identificación de variaciones genómicas y su relación con determinadas enfermedades. Sin embargo, sigue siendo necesario mejorar la accesibilidad, el procesamiento y la interpretación de los datos debido a la enorme cantidad de datos generados y a la gran cantidad de herramientas disponibles para procesarlos. Además de la gran cantidad de algoritmos disponibles para el descubrimiento de variantes, cada tipo de variación y de datos requiere un algoritmo específico. Por ello, se requiere una sólida formación en bioinformática tanto para poder seleccionar el algoritmo más adecuado como para ser capaz de ejecutarlo correctamente. Partiendo de esa base, el objetivo de este proyecto es facilitar el procesamiento de datos de secuenciación para la identificación e interpretación de variantes para los no bioinformáticos. Todo ello mediante la creación de flujos de trabajo de alto rendimiento y con una sólida base científica, sin dejar de ser accesibles y fáciles de utilizar, así como de una plataforma sencilla y muy intuitiva para la interpretación de datos. Se ha realizado una exhaustiva revisión bibliográfica donde se han seleccionado los mejores algoritmos con los que crear flujos de trabajo automáticos para el descubrimiento de variantes cortas germinales (SNPs e indels) y variantes estructurales germinales (SV), incluyendo tanto CNV como reordenamientos cromosómicos, de ADN humano moderno. Además de crear flujos de trabajo para el descubrimiento de variantes, se ha implementado un flujo para la optimización in silico de la detección de CNV a partir de datos de WES y TS (isoCNV). Se ha demostrado que dicha optimización aumenta la sensibilidad de detección utilizando solo datos NGS, lo que es especialmente importante para el diagnóstico clínico. Además, se ha desarrollado un flujo de trabajo para el descubrimiento de variantes mediante la integración de datos de WES y RNA-seq (varRED) que ha demostrado aumentar el número de variantes detectadas sobre las identificadas cuando solo se utilizan datos de WES. Es importante señalar que la identificación de variantes no solo es importante para las poblaciones modernas, el estudio de las variaciones en genomas antiguos es esencial para comprender la evolución humana. Por ello, se ha implementado un flujo de trabajo para la identificación de variantes cortas a partir de muestras antiguas de WGS. Dicho flujo se ha aplicado a una mandíbula humana datada entre el 16980-16510 a.C. Las variantes ancestrales allí descubiertas se informaron sin mayor interpretación debido a la baja cobertura de la muestra. Finalmente, se ha implementado GINO para facilitar la interpretación de las variantes identificadas por los flujos de trabajo desarrollados en esta tesis. GINO es una plataforma fácil de usar para la visualización e interpretación de variantes germinales que requiere licencia de uso. Con el desarrollo de esta tesis se ha conseguido implementar las herramientas necesarias para la identificación de alto rendimiento de todos los tipos de variantes germinales, así como de una poderosa plataforma para visualizar dichas variantes de forma sencilla y rápida. El uso de esta plataforma permite a los no bioinformáticos centrarse en interpretar los resultados sin tener que preocuparse por el procesamiento de los datos con la garantía de que estos sean científicamente robustos. Además, ha sentado las bases para en un futuro próximo implementar una plataforma para el completo análisis y visualización de datos genómicos
Bioinformática

APA, Harvard, Vancouver, ISO, and other styles

16

Wolf, Beat [Verfasser], and Thomas [Gutachter] Dandekar. "Reducing the complexity of OMICS data analysis / Beat Wolf ; Gutachter: Thomas Dandekar." Würzburg : Universität Würzburg, 2017. http://d-nb.info/1142114295/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

17

Schurmann, Claudia [Verfasser]. "Analysis and Integration of Complex Omics Data of the SHIP Study / Claudia Schurmann." Greifswald : Universitätsbibliothek Greifswald, 2013. http://d-nb.info/1042077789/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

18

Hernández, de Diego Rafael. "Development of bioinformatics resources for the integrative analysis of Next Generation omics data." Doctoral thesis, Universitat Politècnica de València, 2017. http://hdl.handle.net/10251/91227.

Full text

Abstract:

The advances in high-throughput sequencing techniques and the technological development accompanying them have favoured the development and popularisation of a new range of genomic research disciplines, collectively known as the omics. These technologies are capable of simultaneously measuring thousands of molecules which are essential for life, including DNA, RNA, proteins, and metabolites. Historically, classical genomic research has followed a reductionist approach by studying the structure, regulation, and function of these biological units independently. However, despite being a powerful analytical tool, the reductionist method cannot explain many of the biological phenomena that take place in living systems. This is because these biological events are not represented by the sum of their components, rather, only the interacting dynamics of the different omics elements can explain their complexity. In recent years Systems Biology has established itself as a multidisciplinary area of research which tries to model the dynamic behaviour of biological systems by holistically studying the interactions between the different omics disciplines; it combines simultaneous measurements of different types of molecules and integrates multiple sources of information in order to identify changing components in a coordinated way and under controlled study conditions. Thus, Systems Biology is an interdisciplinary area that requires biologists, mathematicians, biochemists, and other researchers to work closely together, and in which computer sciences plays a fundamental role because of the volume and complexity of the data handled. This thesis addresses the problem of data management, integration, and analysis in multi-omics studies. More specifically, this research focused on two of the most characteristic computational challenges in Systems Biology: the development of integrated databases and the problem of integrative visualisation. Therefore, the first part of this work was devoted to designing and creating a bioinformatics resource for managing multi-omics experiments. The resulting platform, known as STATegra EMS, offers a complete set of tools that facilitate the storage and organisation of the large datasets generated during omics experiments, and also provides tools for data annotation in the later stages of processing and analysis of the information. The development of this platform required overcoming problems created by the heterogeneity, volume, and high variability of the data. Thus, as part of the solution to these problems, detailed metadata can be recorded within STATegra EMS, allowing dataset discrimination and successful data integration. To aid this process, the platform also offers a collaborative and easy-to-use web interface that combines modern web technologies and well-known community standards to represent the different components of the integrated experiments. The second part of this thesis examines the current situation and challenges in integrative data visualisation in multi-omic experiments, and presents the PaintOmics 3 web tool which was developed to address these issues. Since the capacity of the human brain for visual processing is highly evolved, integrative visualisation combined with data analysis techniques is probably one of the most powerful tools for interpreting and validating results in Systems Biology. PaintOmics 3 provides a comprehensive framework for performing biological function enrichment analyses in experiments with multiple conditions and data types; it combines powerful tools for integrative data visualisation on KEGG molecular-interaction diagrams, biological-process interaction-networks, and statistical analyses. Moreover, unlike similar tools, PaintOmics 3 is interactive and easy to use, and stands out for its flexibility and the variety of omics data types it accepts, which include epigenomics data based on genomic regions, proteomics data, and miRNA-study data.
Los avances en las técnicas de secuenciación y el abaratamiento tecnológico han favorecido el desarrollo y la popularización de una nueva gama de disciplinas de investigación genómica, conocidas como "ómicas". Estas tecnologías son capaces de realizar mediciones simultáneas de miles de moléculas esenciales para la vida, tales como el ADN, el ARN, las proteínas y los metabolitos. Históricamente, la investigación genómica clásica ha seguido un enfoque reduccionista al estudiar la estructura, regulación y función de estas moléculas de manera independiente. Sin embargo, el método reduccionista es incapaz de explicar muchos de los fenómenos biológicos que tienen lugar en un sistema vivo, sugiriendo que la esencia del sistema no puede explicarse simplemente mediante la enumeración de elementos que lo componen, sino que radica en la dinámica de los procesos biológicos que entre ellos acontecen. La Biología de Sistemas se ha establecido en los últimos años como el área de investigación multidisciplinaria que trata de modelar el comportamiento dinámico de los sistemas biológicos a través del estudio holístico de las interacciones entre sus partes, combinando mediciones simultáneas de diferentes tipos de moléculas e integrando múltiples fuentes de información para identificar aquellos componentes que cambian de manera coordinada en las condiciones estudiadas. La BS es un área interdisciplinar que requiere que biólogos, matemáticos, bioquímicos y otros investigadores trabajen en estrecha colaboración, y en la que la informática tiene un papel fundamental dado el volumen y la complejidad de los datos. Esta tesis aborda el problema de la gestión, integración y análisis de los datos en estudios multi-ómicos. Más específicamente, la investigación se ha centrado en dos de los retos computacionales más característicos de la BS: el desarrollo de bases de datos integrativas y el problema de la visualización integrativa. Así, la primera parte de este trabajo se ha dedicado al diseño y creación de un recurso bioinformático para la gestión de experimentos multi-ómicos. La plataforma desarrollada (STATegra EMS) ofrece un conjunto de herramientas que facilitan el almacenamiento y la organización de los grandes conjuntos de datos que son generados durante los experimentos, así como la anotación de las posteriores etapas de procesamiento y análisis de la información. La heterogeneidad, el volumen y la variabilidad de los datos son algunos de los obstáculos que han sido abordados durante el desarrollo del STATegra EMS, con el fin de alcanzar un registro detallado de la meta-información que permita discriminar cada conjunto de datos y lograr así una integración exitosa de la información. Para ello, la plataforma desarrollada ofrece una interfaz web colaborativa y de fácil manejo en la que se combinan modernas tecnologías web y conocidos estándares comunitarios para la representación de los diferentes componentes del experimento. En la segunda parte de esta tesis se discuten la situación actual y las dificultades de la visualización integrativa de datos multi-ómicos, y se presenta la herramienta desarrollada, PaintOmics 3. La visualización integrativa combinada con técnicas de análisis de datos es probablemente una de las herramientas más poderosa para la interpretación y validación de los resultados en BS. PaintOmics 3 proporciona un completo marco de trabajo para realizar análisis de enriquecimiento de funciones biológicas en experimentos con múltiples condiciones y tipos de datos, en el que se combinan potentes herramientas de visualización integrativa sobre diagramas de interacción molecular y redes de reacción KEGG, redes de interacción de procesos biológicos, y estudios estadísticos de los datos. Además, a diferencia de otras herramientas, PaintOmics 3 destaca por su facilidad de uso e interactividad, así como por su flexibilidad y variedad de los datos
Els avenços en les tècniques de seqüenciació d'alt rendiment i l'abaratiment tecnològic posterior han afavorit el desenvolupament i la popularització d'una nova gamma de disciplines d'investigació genòmica, conegudes col¿lectivament com a "òmiques". Aquestes tecnologies permeten realitzar mesuraments simultanis de milers de molècules essencials per a la vida, com ara l'ADN, l'ARN, les proteïnes i els metabòlits. Històricament, la investigació genòmica clàssica ha seguit un enfocament reduccionista a l'hora d'estudiar l'estructura, la regulació i la funció d'aquestes unitats biològiques de manera independent. No obstant això, el mètode reduccionista és incapaç d'explicar molts dels fenòmens biològics que tenen lloc en un sistema viu, suggerint que l'essència del sistema no es pot explicar simplement mitjançant l'enumeració d'elements que el componen, sinó que radica en la dinàmica dels processos biològics que tenen lloc entre ells. La Biologia de Sistemes (BS) ha esdevingut els darrers anys l'àrea d'investigació multidisciplinària que tracta de modelar el comportament dinàmic dels sistemes biològics a través de l'estudi holístic de les interaccions entre les seues parts, combinant mesuraments simultanis de diferents tipus de molècules i integrant múltiples fonts d'informació per a identificar aquells components que canvien de manera coordinada en les condicions objecte d'estudi. La BS és una àrea interdisciplinar que requereix que biòlegs, matemàtics, bioquímics i altres investigadors treballen plegats i en la qual la informàtica té un paper fonamental, atès el volum i la complexitat de les dades emprades. Aquesta tesi aborda el problema de la gestió, la integració i l'anàlisi de les dades en estudis multi-òmics. Més concretament, la investigació s'ha centrat en dos dels reptes computacionals més característics de la BS: el desenvolupament de bases de dades integratives i el problema de la visualització integrativa. Així, la primera part d'aquest treball s'ha dedicat al disseny i creació d'un recurs bioinformàtic per a la gestió d'experiments multi-òmics. La plataforma desenvolupada (STATegra EMS) ofereix un conjunt d'eines que faciliten l'emmagatzematge i l'organització dels grans conjunts de dades que són generats durant aquests experiments, així com l'anotació de les etapes posteriors de processament i anàlisi de la informació. L'heterogeneïtat, el volum i l'alta variabilitat de les dades òmiques són alguns dels obstacles que han estat abordats durant el desenvolupament de l'STATegra EMS, amb la finalitat d'assolir un registre detallat de la meta-informació que permeta discriminar cada conjunt de dades i aconseguir així una integració reeixida de la informació. Per a aconseguir-ho, la plataforma desenvolupada ofereix una interfície web collaborativa i fàcil de fer servir que conjumina modernes tecnologies web i coneguts estàndards comunitaris per a la representació dels diferents components de l'experiment. En la segona part d'aquesta tesi s'hi estudia la situació actual i les dificultats de la visualització integrativa de dades en experiments multi-òmics i s'hi presenta l'eina web desenvolupada: PaintOmics 3. La visualització integrativa en combinació amb tècniques d'anàlisi de dades és probablement una de les eines més poderosa per a la interpretació i validació dels resultats en BS. PaintOmics 3 proporciona un marc complet de treball per a fer anàlisis d'enriquiment de funcions biològiques en experiments amb múltiples condicions i tipus de dades; s'hi combinen eines potents de visualització integrativa de dades sobre diagrames d'interacció molecular i xarxes de reacció KEGG, xarxes d'interacció de processos biològics i estudis estadístics de les dades. A més, a diferència d'altres eines desenvolupades, PaintOmics 3 és molt interactiva i fàcil d'usar, i destaca per la flexibilitat i varietat de dades que accepta, co
Hernández De Diego, R. (2017). Development of bioinformatics resources for the integrative analysis of Next Generation omics data [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/91227
TESIS

APA, Harvard, Vancouver, ISO, and other styles

19

Cao, Yingying [Verfasser], and Daniel [Akademischer Betreuer] Hoffmann. "Computational analysis and interpretation of multi-omics data / Yingying Cao ; Betreuer: Daniel Hoffmann." Duisburg, 2021. http://d-nb.info/1234911124/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

20

Faria, Do Valle Italo <1990&gt. "New Approaches for the Molecular Profiling of Human Cancers through Omics Data Analysis." Doctoral thesis, Alma Mater Studiorum - Università di Bologna, 2017. http://amsdottorato.unibo.it/7997/1/FariaDoValle_Italo_tesi.pdf.

Full text

Abstract:

In this thesis, we present three studies in which we applied ad hoc computational methods for the molecular profiling of human cancers using omics data. In the first study our main goal was to develop a pipeline of analysis able to detect a wide range of single nucleotide mutations with high validation rates. We combined two standard tools to create the GATK-LODN method, and we applied our pipeline to exome sequencing data of hematological and solid tumors. We created simulated datasets and performed experimental validation to test the pipeline sensitivity and specificity. In the second study we characterized the gene expression profiles of 11 tumor types aiming the discovery of multi-tumor drug targets and new strategies of drug combination and repurposing. We clustered tumors and applied a network-based analysis to integrate gene expression and protein interaction information. We defined three multi-tumor gene signatures, characterized by the following categories: NF-KB signaling, chromosomal instability, ubiquitin-proteasome system, DNA metabolism, and apoptosis. We evaluated the gene signatures based on mutational, pharmacological and clinical evidences. Moreover, we defined new pharmacological strategies validated by in vitro experiments that showed inhibition of cell growth in two tumor cell lines. In the third study we evaluated thyroid gene expression profiles of normal, Papillary Thyroid Carcinoma (PTC) and Anaplastic Thyroid Carcinoma (ATC) samples. The samples grouped in a progressional trend according to tissue type and the main biological processes affected in the normal to PTC transition were related to extracellular matrix and cell morphology; and those affected in the PTC to ATC transition were related to the control of cell cycle. We defined signatures related to each step of tumor progression and mapped the signatures onto protein-protein interaction and transcriptomical regulatory networks to prioritize genes for following experimental validation.

APA, Harvard, Vancouver, ISO, and other styles

21

Fatai, Azeez Ayomide. "Computational analysis of multilevel omics data for the elucidation of molecular mechanisms of cancer." University of the Western Cape, 2015. http://hdl.handle.net/11394/4782.

Full text

Abstract:

Philosophiae Doctor - PhD
Cancer is a group of diseases that arises from irreversible genomic and epigenomic alterations that result in unrestrained proliferation of abnormal cells. Detailed understanding of the molecular mechanisms underlying a cancer would aid the identification of most, if not all, genes responsible for its progression and the development of molecularly targeted chemotherapy. The challenge of recurrence after treatment shows that our understanding of cancer mechanisms is still poor. As a contribution to overcoming this challenge, we provide an integrative multi-omic analysis on glioblastoma multiforme (GBM) for which large data sets on di erent classes of genomic and epigenomic alterations have been made available in the Cancer Genome Atlas data portal. The rst part of this study involves protein network analysis for the elucidation of GBM tumourigenic molecular mechanisms, identification of driver genes, prioritization of genes in chromosomal regions with copy number alteration, and co-expression and transcriptional analysis. Functional modules were obtained by edge-betweenness clustering of a protein network constructed from genes with predicted functional impact mutations and differentially expressed genes. Pathway enrichment analysis was performed on each module to identify statistical overrepresentation of signaling pathways. Known and novel candidate cancer driver genes were identi ed in the modules, and functionally relevant genes in chromosomal regions altered by homologous deletion or high-level amplication were prioritized with the protein network. Co-expressed modules enriched in cancer biological processes and transcription factor targets were identified using network genes that demonstrated high expression variance. Our findings show that GBM's molecular mechanisms are much more complex than those reported in previous studies. We next identified differentially expressed miRNAs for which target genes associated with the protein network were also differentially expressed. MiRNAs and target genes were prioritized based on the number of targeted genes and targeting miRNAs, respectively. MiRNAs that correlated with time to progression were selected by an elastic net-penalized Cox regression model for survival analysis. These miRNA were combined into a signature that independently predicted adjuvant therapy-linked progression-free survival in GBM and its subtypes and overall survival in GBM. The results show that miRNAs play significant roles in GBM progression and patients' survival finally, a prognostic mRNA signature that independently predicted progression-free and overall survival was identified. Pathway enrichment analysis was carried on genes with high expression variance across a cohort to identify those in chemoradioresistance associated pathways. A support vector machine-based method was then used to identify a set of genes that discriminated between rapidly- and slowly-progressing GBM patients, with minimal 5 % cross-validation error rate. The prognostic value of the gene set was demonstrated by its ability to predict adjuvant therapy-linked progression-free and overall survival in GBM and its subtypes and was validated in an independent data set. We have identified a set of genes involved in tumourigenic mechanisms that could potentially be exploited as targets in drug development for the treatment of primary and recurrent GBM. Furthermore, given their demonstrated accuracy in this study, the identified miRNA and mRNA signatures have strong potential to be combined and developed into a robust clinical test for predicting prognosis and treatment response.

APA, Harvard, Vancouver, ISO, and other styles

22

Zuo, Yiming. "Differential Network Analysis based on Omic Data for Cancer Biomarker Discovery." Diss., Virginia Tech, 2017. http://hdl.handle.net/10919/78217.

Full text

Abstract:

Recent advances in high-throughput technique enables the generation of a large amount of omic data such as genomics, transcriptomics, proteomics, metabolomics, glycomics etc. Typically, differential expression analysis (e.g., student's t-test, ANOVA) is performed to identify biomolecules (e.g., genes, proteins, metabolites, glycans) with significant changes on individual level between biologically disparate groups (disease cases vs. healthy controls) for cancer biomarker discovery. However, differential expression analysis on independent studies for the same clinical types of patients often led to different sets of significant biomolecules and had only few in common. This may be attributed to the fact that biomolecules are members of strongly intertwined biological pathways and highly interactive with each other. Without considering these interactions, differential expression analysis could lead to biased results. Network-based methods provide a natural framework to study the interactions between biomolecules. Commonly used data-driven network models include relevance network, Bayesian network and Gaussian graphical models. In addition to data-driven network models, there are many publicly available databases such as STRING, KEGG, Reactome, and ConsensusPathDB, where one can extract various types of interactions to build knowledge-driven networks. While both data- and knowledge-driven networks have their pros and cons, an appropriate approach to incorporate the prior biological knowledge from publicly available databases into data-driven network model is desirable for more robust and biologically relevant network reconstruction. Recently, there has been a growing interest in differential network analysis, where the connection in the network represents a statistically significant change in the pairwise interaction between two biomolecules in different groups. From the rewiring interactions shown in differential networks, biomolecules that have strongly altered connectivity between distinct biological groups can be identified. These biomolecules might play an important role in the disease under study. In fact, differential expression and differential network analyses investigate omic data from two complementary perspectives: the former focuses on the change in individual biomolecule level between different groups while the latter concentrates on the change in pairwise biomolecules level. Therefore, an approach that can integrate differential expression and differential network analyses is likely to discover more reliable and powerful biomarkers. To achieve these goals, we start by proposing a novel data-driven network model (i.e., LOPC) to reconstruct sparse biological networks. The sparse networks only contains direct interactions between biomolecules which can help researchers to focus on the more informative connections. Then we propose a novel method (i.e., dwgLASSO) to incorporate prior biological knowledge into data-driven network model to build biologically relevant networks. Differential network analysis is applied based on the networks constructed for biologically disparate groups to identify cancer biomarker candidates. Finally, we propose a novel network-based approach (i.e., INDEED) to integrate differential expression and differential network analyses to identify more reliable and powerful cancer biomarker candidates. INDEED is further expanded as INDEED-M to utilize omic data at different levels of human biological system (e.g., transcriptomics, proteomics, metabolomics), which we believe is promising to increase our understanding of cancer. Matlab and R packages for the proposed methods are developed and available at Github (https://github.com/Hurricaner1989) to share with the research community.
Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

23

Monteiro, Martins Sara [Verfasser]. "Bioinformatics analysis of multi-omics data elucidates U2 snRNP function in transcription / Sara Monteiro Martins." Göttingen : Niedersächsische Staats- und Universitätsbibliothek Göttingen, 2021. http://d-nb.info/1239894643/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

24

Bockmayr, Michael [Verfasser]. "Integrative analysis of "omics" data and histopathological features in breast and ovarian cancer / Michael Bockmayr." Berlin : Medizinische Fakultät Charité - Universitätsmedizin Berlin, 2017. http://d-nb.info/1126504262/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

25

Elhezzani, Najla Saad R. "New statistical methodologies for improved analysis of genomic and omic data." Thesis, King's College London (University of London), 2018. https://kclpure.kcl.ac.uk/portal/en/theses/new-statistical-methodologies-for-improved-analysis-of-genomic-and-omic-data(eb8d95f4-e926-4c54-984f-94d86306525a).html.

Full text

Abstract:

We develop statistical tools for analyzing different types of phenotypic data in genome-wide settings. When the phenotype of interest is a binary case-control status, most genome-wide association studies (GWASs) use randomly selected samples from the population (hereafter bases) as the control set. This approach is successful when the trait of interest is very rare; otherwise, a loss in the statistical power to detect disease-associated variants is expected. To address this, we propose a joint analysis of the three types of samples; cases, bases and controls. This is done by modeling the bases as a mixture of multinomial logistic functions of cases and controls, according to disease prevalence. In a typical GWAS, where thousands of single-nucleotide polymorphisms (SNPs) are available for testing, score-based test statistics are ideal in this case. Other tests of associations such as Wald’s and likelihood ratio tests are known to be asymptotically equivalent to the score test, however their performance under small sample sizes can vary significantly. In order to allow the test comparison to be performed under the proposed case-base-control (CBC) design, we provide an estimation procedure using the maximum likelihood (ML) method along with the expectation-maximization (EM) algorithm. Simulations show that combining the three samples can increase the power to detect disease-associated variants, though a very large base sample set can compensate for lack of controls. In the second part of the thesis, we consider a joint analysis of both genome-wide SNPs as well as multiple phenotypes, with a focus on the challenges they present in the estimation of SNP heritability. The current standard for performing this task is fit-ting a variance component model, despite its tendency to produce boundary estimates when small sample sizes are used. We propose a Bayesian covariance component model (BCCM) that takes into account genetic correlation among phenotypes and genetic correlation among individuals. The use of Bayesian methods allows us to circumvent some issues related to small sample sizes, mainly overfitting and boundary estimates. Using gene expression pathways, we demonstrate a significant improvement in SNP heritability estimates over univariate and ML-based methods, thus explaining why recent progress in eQTL identification has been limited. I published this work as an article in the European Journal of Human genetics. In the third part of the thesis, we study the prospects of using the proposed BCCM for phenotype prediction. Results from real data show consistency in accuracy between ML based methods and the proposed Bayesian method, when effect sizes are estimated using their posterior mode. It is also noted that an initial imputation step relatively increases the predictive accuracy.

APA, Harvard, Vancouver, ISO, and other styles

26

Hafez, Khafaga Ahmed Ibrahem 1987. "Bioinformatics approaches for integration and analysis of fungal omics data oriented to knowledge discovery and diagnosis." Doctoral thesis, TDX (Tesis Doctorals en Xarxa), 2021. http://hdl.handle.net/10803/671160.

Full text

Abstract:

Aquesta tesi presenta una sèrie de recursos bioinformàtics desenvolupats per a donar suport en l'anàlisi de dades de NGS i altres òmics en el camp d'estudi i diagnòstic d'infeccions fúngiques. Hem dissenyat tècniques de computació per identificar nous biomarcadors i determinar potencial trets de resistència, pronosticant les característiques de les seqüències d'ADN/ARN, i planejant estratègies optimitzades de seqüenciació per als estudis de hoste-patogen transcriptomes (Dual RNA-seq). Hem dissenyat i desenvolupat tambe una solució bioinformàtica composta per un component de costat de servidor (constituït per diferents pipelines per a fer anàlisi VariantSeq, Denovoseq i RNAseq) i un altre component constituït per eines software basades en interfícies gràfiques (GUIs) per permetre a l'usuari accedir, gestionar i executar els pipelines mitjançant interfícies amistoses. També hem desenvolupat i validat un software per a l'anàlisi de seqüències i el disseny dels primers (SeqEditor) orientat a la identificació i detecció d'espècies en el diagnòstic de la PCR. Finalment, hem desenvolupat CandidaMine una base de dades integrant dades omiques de fongs patògens.
The aim of this thesis has been to develop a series of bioinformatic resources for analysis of NGS data, proteomics, or other omics technologies in the field of study and diagnosis of yeast infections. In particular, we have explored and designed distinct computational techniques to identify novel biomarker candidates of resistance traits, to predict DNA/RNA sequences’ features, and to optimize sequencing strategies for host-pathogen transcriptome sequencing studies (Dual RNA-seq). We have designed and developed an efficient bioinformatic solution composed of a server-side component constituted by distinct pipelines for VariantSeq, Denovoseq and RNAseq analyses as well as another component constituted by distinct GUI-based software to let the user to access, manage and run the pipelines with friendly-to-use interfaces. We have also designed and developed SeqEditor a software for sequence analysis and primers design for species identification and detection in PCR diagnosis. We also have developed CandidaMine an integrated data warehouse of fungal omics and for data analysis and knowledge discovery.

APA, Harvard, Vancouver, ISO, and other styles

27

Tsai, Tsung-Heng. "Bayesian Alignment Model for Analysis of LC-MS-based Omic Data." Diss., Virginia Tech, 2014. http://hdl.handle.net/10919/64151.

Full text

Abstract:

Liquid chromatography coupled with mass spectrometry (LC-MS) has been widely used in various omic studies for biomarker discovery. Appropriate LC-MS data preprocessing steps are needed to detect true differences between biological groups. Retention time alignment is one of the most important yet challenging preprocessing steps, in order to ensure that ion intensity measurements among multiple LC-MS runs are comparable. In this dissertation, we propose a Bayesian alignment model (BAM) for analysis of LC-MS data. BAM uses Markov chain Monte Carlo (MCMC) methods to draw inference on the model parameters and provides estimates of the retention time variability along with uncertainty measures, enabling a natural framework to integrate information of various sources. From methodology development to practical application, we investigate the alignment problem through three research topics: 1) development of single-profile Bayesian alignment model, 2) development of multi-profile Bayesian alignment model, and 3) application to biomarker discovery research. Chapter 2 introduces the profile-based Bayesian alignment using a single chromatogram, e.g., base peak chromatogram from each LC-MS run. The single-profile alignment model improves on existing MCMC-based alignment methods through 1) the implementation of an efficient MCMC sampler using a block Metropolis-Hastings algorithm, and 2) an adaptive mechanism for knot specification using stochastic search variable selection (SSVS). Chapter 3 extends the model to integrate complementary information that better captures the variability in chromatographic separation. We use Gaussian process regression on the internal standards to derive a prior distribution for the mapping functions. In addition, a clustering approach is proposed to identify multiple representative chromatograms for each LC-MS run. With the Gaussian process prior, these chromatograms are simultaneously considered in the profile-based alignment, which greatly improves the model estimation and facilitates the subsequent peak matching process. Chapter 4 demonstrates the applicability of the proposed Bayesian alignment model to biomarker discovery research. We integrate the proposed Bayesian alignment model into a rigorous preprocessing pipeline for LC-MS data analysis. Through the developed analysis pipeline, candidate biomarkers for hepatocellular carcinoma (HCC) are identified and confirmed on a complementary platform.
Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

28

Ruffalo, Matthew M. "Algorithms for Constructing Features for Integrated Analysis of Disparate Omic Data." Case Western Reserve University School of Graduate Studies / OhioLINK, 2016. http://rave.ohiolink.edu/etdc/view?acc_num=case1449238712.

Full text

APA, Harvard, Vancouver, ISO, and other styles

29

Li, Yichao. "Algorithmic Methods for Multi-Omics Biomarker Discovery." Ohio University / OhioLINK, 2018. http://rave.ohiolink.edu/etdc/view?acc_num=ohiou1541609328071533.

Full text

APA, Harvard, Vancouver, ISO, and other styles

30

Berti, Elisa. "Applicazione del metodo QDanet_PRO alla classificazione di dati omici." Bachelor's thesis, Alma Mater Studiorum - Università di Bologna, 2015. http://amslaurea.unibo.it/9411/.

Full text

Abstract:

Il presente lavoro di tesi si pone nell'ambito dell'analisi dati attraverso un metodo (QDanet_PRO), elaborato dal Prof. Remondini in collaborazine coi Dott. Levi e Malagoli, basato sull'analisi discriminate a coppie e sulla Teoria dei Network, che ha come obiettivo la classificazione di dati contenuti in dataset dove il numero di campioni è molto ridotto rispetto al numero di variabili. Attraverso questo studio si vogliono identificare delle signature, ovvero un'insieme ridotto di variabili che siano in grado di classificare correttamente i campioni in base al comportamento delle variabili stesse. L'elaborazione dei diversi dataset avviene attraverso diverse fasi; si comincia con una un'analisi discriminante a coppie per identificare le performance di ogni coppia di variabili per poi passare alla ricerca delle coppie più performanti attraverso un processo che combina la Teoria dei Network con la Cross Validation. Una volta ottenuta la signature si conclude l'elaborazione con una validazione per avere un'analisi quantitativa del successo o meno del metodo.

APA, Harvard, Vancouver, ISO, and other styles

31

Bylesjö, Max. "Latent variable based computational methods for applications in life sciences : Analysis and integration of omics data sets." Doctoral thesis, Umeå universitet, Kemi, 2008. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-1616.

Full text

Abstract:

With the increasing availability of high-throughput systems for parallel monitoring of multiple variables, e.g. levels of large numbers of transcripts in functional genomics experiments, massive amounts of data are being collected even from single experiments. Extracting useful information from such systems is a non-trivial task that requires powerful computational methods to identify common trends and to help detect the underlying biological patterns. This thesis deals with the general computational problems of classifying and integrating high-dimensional empirical data using a latent variable based modeling approach. The underlying principle of this approach is that a complex system can be characterized by a few independent components that characterize the systematic properties of the system. Such a strategy is well suited for handling noisy, multivariate data sets with strong multicollinearity structures, such as those typically encountered in many biological and chemical applications. The main foci of the studies this thesis is based upon are applications and extensions of the orthogonal projections to latent structures (OPLS) method in life science contexts. OPLS is a latent variable based regression method that separately describes systematic sources of variation that are related and unrelated to the modeling aim (for instance, classifying two different categories of samples). This separation of sources of variation can be used to pre-process data, but also has distinct advantages for model interpretation, as exemplified throughout the work. For classification cases, a probabilistic framework for OPLS has been developed that allows the incorporation of both variance and covariance into classification decisions. This can be seen as a unification of two historical classification paradigms based on either variance or covariance. In addition, a non-linear reformulation of the OPLS algorithm is outlined, which is useful for particularly complex regression or classification tasks. The general trend in functional genomics studies in the post-genomics era is to perform increasingly comprehensive characterizations of organisms in order to study the associations between their molecular and cellular components in greater detail. Frequently, abundances of all transcripts, proteins and metabolites are measured simultaneously in an organism at a current state or over time. In this work, a generalization of OPLS is described for the analysis of multiple data sets. It is shown that this method can be used to integrate data in functional genomics experiments by separating the systematic variation that is common to all data sets considered from sources of variation that are specific to each data set.
Funktionsgenomik är ett forskningsområde med det slutgiltiga målet att karakterisera alla gener i ett genom hos en organism. Detta inkluderar studier av hur DNA transkriberas till mRNA, hur det sedan translateras till proteiner och hur dessa proteiner interagerar och påverkar organismens biokemiska processer. Den traditionella ansatsen har varit att studera funktionen, regleringen och translateringen av en gen i taget. Ny teknik inom fältet har dock möjliggjort studier av hur tusentals transkript, proteiner och små molekyler uppträder gemensamt i en organism vid ett givet tillfälle eller över tid. Konkret innebär detta även att stora mängder data genereras även från små, isolerade experiment. Att hitta globala trender och att utvinna användbar information från liknande data-mängder är ett icke-trivialt beräkningsmässigt problem som kräver avancerade och tolkningsbara matematiska modeller. Denna avhandling beskriver utvecklingen och tillämpningen av olika beräkningsmässiga metoder för att klassificera och integrera stora mängder empiriskt (uppmätt) data. Gemensamt för alla metoder är att de baseras på latenta variabler: variabler som inte uppmätts direkt utan som beräknats från andra, observerade variabler. Detta koncept är väl anpassat till studier av komplexa system som kan beskrivas av ett fåtal, oberoende faktorer som karakteriserar de huvudsakliga egenskaperna hos systemet, vilket är kännetecknande för många kemiska och biologiska system. Metoderna som beskrivs i avhandlingen är generella men i huvudsak utvecklade för och tillämpade på data från biologiska experiment. I avhandlingen demonstreras hur dessa metoder kan användas för att hitta komplexa samband mellan uppmätt data och andra faktorer av intresse, utan att förlora de egenskaper hos metoden som är kritiska för att tolka resultaten. Metoderna tillämpas för att hitta gemensamma och unika egenskaper hos regleringen av transkript och hur dessa påverkas av och påverkar små molekyler i trädet poppel. Utöver detta beskrivs ett större experiment i poppel där relationen mellan nivåer av transkript, proteiner och små molekyler undersöks med de utvecklade metoderna.

APA, Harvard, Vancouver, ISO, and other styles

32

Bylesjö, Max. "Latent variable based computational methods for applications in life sciences : Analysis and integration of omics data sets /." Umeå : Chemistry Kemi, 2008. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-1616.

Full text

APA, Harvard, Vancouver, ISO, and other styles

33

Ronen, Jonathan. "Integrative analysis of data from multiple experiments." Doctoral thesis, Humboldt-Universität zu Berlin, 2020. http://dx.doi.org/10.18452/21612.

Full text

Abstract:

Auf die Entwicklung der Hochdurchsatz-Sequenzierung (HTS) folgte eine Reihe von speziellen Erweiterungen, die erlauben verschiedene zellbiologischer Aspekte wie Genexpression, DNA-Methylierung, etc. zu messen. Die Analyse dieser Daten erfordert die Entwicklung von Algorithmen, die einzelne Experimenteberücksichtigen oder mehrere Datenquellen gleichzeitig in betracht nehmen. Der letztere Ansatz bietet besondere Vorteile bei Analyse von einzelligen RNA-Sequenzierung (scRNA-seq) Experimenten welche von besonders hohem technischen Rauschen, etwa durch den Verlust an Molekülen durch die Behandlung geringer Ausgangsmengen, gekennzeichnet sind. Um diese experimentellen Defizite auszugleichen, habe ich eine Methode namens netSmooth entwickelt, welche die scRNA-seq-Daten entrascht und fehlende Werte mittels Netzwerkdiffusion über ein Gennetzwerk imputiert. Das Gennetzwerk reflektiert dabei erwartete Koexpressionsmuster von Genen. Unter Verwendung eines Gennetzwerks, das aus Protein-Protein-Interaktionen aufgebaut ist, zeige ich, dass netSmooth anderen hochmodernen scRNA-Seq-Imputationsmethoden bei der Identifizierung von Blutzelltypen in der Hämatopoese, zur Aufklärung von Zeitreihendaten unter Verwendung eines embryonalen Entwicklungsdatensatzes und für die Identifizierung von Tumoren der Herkunft für scRNA-Seq von Glioblastomen überlegen ist. netSmooth hat einen freien Parameter, die Diffusionsdistanz, welche durch datengesteuerte Metriken optimiert werden kann. So kann netSmooth auch dann eingesetzt werden, wenn der optimale Diffusionsabstand nicht explizit mit Hilfe von externen Referenzdaten optimiert werden kann. Eine integrierte Analyse ist auch relevant wenn multi-omics Daten von mehrerer Omics-Protokolle auf den gleichen biologischen Proben erhoben wurden. Hierbei erklärt jeder einzelne dieser Datensätze nur einen Teil des zellulären Systems, während die gemeinsame Analyse ein vollständigeres Bild ergibt. Ich entwickelte eine Methode namens maui, um eine latente Faktordarstellungen von multiomics Daten zu finden.
The development of high throughput sequencing (HTS) was followed by a swarm of protocols utilizing HTS to measure different molecular aspects such as gene expression (transcriptome), DNA methylation (methylome) and more. This opened opportunities for developments of data analysis algorithms and procedures that consider data produced by different experiments. Considering data from seemingly unrelated experiments is particularly beneficial for Single cell RNA sequencing (scRNA-seq). scRNA-seq produces particularly noisy data, due to loss of nucleic acids when handling the small amounts in single cells, and various technical biases. To address these challenges, I developed a method called netSmooth, which de-noises and imputes scRNA-seq data by applying network diffusion over a gene network which encodes expectations of co-expression patterns. The gene network is constructed from other experimental data. Using a gene network constructed from protein-protein interactions, I show that netSmooth outperforms other state-of-the-art scRNA-seq imputation methods at the identification of blood cell types in hematopoiesis, as well as elucidation of time series data in an embryonic development dataset, and identification of tumor of origin for scRNA-seq of glioblastomas. netSmooth has a free parameter, the diffusion distance, which I show can be selected using data-driven metrics. Thus, netSmooth may be used even in cases when the diffusion distance cannot be optimized explicitly using ground-truth labels. Another task which requires in-tandem analysis of data from different experiments arises when different omics protocols are applied to the same biological samples. Analyzing such multiomics data in an integrated fashion, rather than each data type (RNA-seq, DNA-seq, etc.) on its own, is benefitial, as each omics experiment only elucidates part of an integrated cellular system. The simultaneous analysis may reveal a comprehensive view.

APA, Harvard, Vancouver, ISO, and other styles

34

Gomari, Daniel Parviz [Verfasser], Jan [Akademischer Betreuer] Krumsiek, Karsten [Gutachter] Suhre, and Jan [Gutachter] Krumsiek. "Novel network-based methods for multi-omics data analysis and interpretation / Daniel Parviz Gomari ; Gutachter: Karsten Suhre, Jan Krumsiek ; Betreuer: Jan Krumsiek." München : Universitätsbibliothek der TU München, 2021. http://d-nb.info/1235664775/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

35

Meng, Chen [Verfasser], Bernhard [Akademischer Betreuer] Küster, and Dmitrij [Akademischer Betreuer] Frischmann. "Application of multivariate methods to the integrative analysis of high-throughput omics data / Chen Meng. Betreuer: Bernhard Küster. Gutachter: Bernhard Küster ; Dmitrij Frischmann." München : Universitätsbibliothek der TU München, 2016. http://d-nb.info/1082347299/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

36

Denecker, Thomas. "Bioinformatique et analyse de données multiomiques : principes et applications chez les levures pathogènes Candida glabrata et Candida albicans Functional networks of co-expressed genes to explore iron homeostasis processes in the pathogenic yeast Candida glabrata Efficient, quick and easy-to-use DNA replication timing analysis with START-R suite FAIR_Bioinfo: a turnkey training course and protocol for reproducible computational biology Label-free quantitative proteomics in Candida yeast species: technical and biological replicates to assess data reproducibility Rendre ses projets R plus accessibles grâce à Shiny Pixel: a content management platform for quantitative omics data Empowering the detection of ChIP-seq "basic peaks" (bPeaks) in small eukaryotic genomes with a web user-interactive interface A hypothesis-driven approach identifies CDK4 and CDK6 inhibitors as candidate drugs for treatments of adrenocortical carcinomas Characterization of the replication timing program of 6 human model cell lines." Thesis, université Paris-Saclay, 2020. http://www.theses.fr/2020UPASL010.

Full text

Abstract:

Plusieurs évolutions sont constatées dans la recherche en biologie. Tout d’abord, les études menées reposent souvent sur des approches expérimentales quantitatives. L’analyse et l’interprétation des résultats requièrent l’utilisation de l’informatique et des statistiques. Également, en complément des études centrées sur des objets biologiques isolés, les technologies expérimentales haut débit permettent l’étude des systèmes (caractérisation des composants du système ainsi que des interactions entre ces composants). De très grandes quantités de données sont disponibles dans les bases de données publiques, librement réutilisables pour de nouvelles problématiques. Enfin, les données utiles pour les recherches en biologie sont très hétérogènes (données numériques, de textes, images, séquences biologiques, etc.) et conservées sur des supports d’information également très hétérogènes (papiers ou numériques). Ainsi « l’analyse de données » s’est petit à petit imposée comme une problématique de recherche à part entière et en seulement une dizaine d’années, le domaine de la « Bioinformatique » s’est en conséquence totalement réinventé. Disposer d’une grande quantité de données pour répondre à un questionnement biologique n’est souvent pas le défi principal. La vraie difficulté est la capacité des chercheurs à convertir les données en information, puis en connaissance. Dans ce contexte, plusieurs problématiques de recherche en biologie ont été abordées lors de cette thèse. La première concerne l’étude de l’homéostasie du fer chez la levure pathogène Candida glabrata. La seconde concerne l’étude systématique des modifications post-traductionnelles des protéines chez la levure pathogène Candida albicans. Pour ces deux projets, des données « omiques » ont été exploitées : transcriptomiques et protéomiques. Des outils bioinformatiques et des outils d’analyses ont été implémentés en parallèle conduisant à l’émergence de nouvelles hypothèses de recherche en biologie. Une attention particulière et constante a aussi été portée sur les problématiques de reproductibilité et de partage des résultats avec la communauté scientifique
Biological research is changing. First, studies are often based on quantitative experimental approaches. The analysis and the interpretation of the obtained results thus need computer science and statistics. Also, together with studies focused on isolated biological objects, high throughput experimental technologies allow to capture the functioning of biological systems (identification of components as well as the interactions between them). Very large amounts of data are also available in public databases, freely reusable to solve new open questions. Finally, the data in biological research are heterogeneous (digital data, texts, images, biological sequences, etc.) and stored on multiple supports (paper or digital). Thus, "data analysis" has gradually emerged as a key research issue, and in only ten years, the field of "Bioinformatics" has been significantly changed. Having a large amount of data to answer a biological question is often not the main challenge. The real challenge is the ability of researchers to convert the data into information and then into knowledge. In this context, several biological research projects were addressed in this thesis. The first concerns the study of iron homeostasis in the pathogenic yeast Candida glabrata. The second concerns the systematic investigation of post-translational modifications of proteins in the pathogenic yeast Candida albicans. In these two projects, omics data were used: transcriptomics and proteomics. Appropriate bioinformatics and analysis tools were developed, leading to the emergence of new research hypotheses. Particular and constant attention has also been paid to the question of data reproducibility and sharing of results with the scientific community

APA, Harvard, Vancouver, ISO, and other styles

37

Fonseca, Renata Santana. "Modelos de sobreviv?ncia com fra??o de cura e omiss?o nas covari?veis." Universidade Federal do Rio Grande do Norte, 2009. http://repositorio.ufrn.br:8080/jspui/handle/123456789/17004.

Full text

Abstract:

Made available in DSpace on 2014-12-17T15:26:37Z (GMT). No. of bitstreams: 1 RenataSF.pdf: 443214 bytes, checksum: 93598adf420b7d48eb5b8b2c6e619c38 (MD5) Previous issue date: 2009-03-06
Coordena??o de Aperfei?oamento de Pessoal de N?vel Superior
In this work we study the survival cure rate model proposed by Yakovlev (1993) that are considered in a competing risk setting. Covariates are introduced for modeling the cure rate and we allow some covariates to have missing values. We consider only the cases by which the missing covariates are categorical and implement the EM algorithm via the method of weights for maximum likelihood estimation. We present a Monte Carlo simulation experiment to compare the properties of the estimators based on this method with those estimators under the complete case scenario. We also evaluate, in this experiment, the impact in the parameter estimates when we increase the proportion of immune and censored individuals among the not immune one. We demonstrate the proposed methodology with a real data set involving the time until the graduation for the undergraduate course of Statistics of the Universidade Federal do Rio Grande do Norte
Neste trabalho estudamos o modelo de sobreviv^encia com fra??o de cura proposto por Yakovlev et al. (1993) que possui uma estrutura de riscos competitivos. Covari?veis s?o introduzidas para modelar o n?mero m?dio de riscos e permitimos que algumas destas covari?veis apresentem omiss?o. Consideramos apenas os casos em que as covari?veis omissas s?o categ?ricas e as estimativas dos par?metros s?o obtidas atrav?s do algoritmo EM ponderado. Apresentamos uma s?rie de simula??es para confrontar as estimativas obtidas atrav?s deste m?todo com as obtidas quando se exclui do banco de dados as observa??es que apresentam omiss?o, conhecida como an?lise de casos completos. Avaliamos tamb?m atrav?s de simula??es, o impacto na estimativa dos par?metros quando aumenta-se o percentual de curados e de censura entre indiv?duos n?o curados. Um conjunto de dados reais referentes ao tempo at? a conclus?o do curso de estat?stica na Universidade Federal do Rio Grande do Norte ? utilizado para ilustrar o m?todo.

APA, Harvard, Vancouver, ISO, and other styles

38

Voillet, Valentin. "Approche intégrative du développement musculaire afin de décrire le processus de maturation en lien avec la survie néonatale." Thesis, Toulouse, INPT, 2016. http://www.theses.fr/2016INPT0067/document.

Full text

Abstract:

Depuis plusieurs années, des projets d'intégration de données omiques se sont développés, notamment avec objectif de participer à la description fine de caractères complexes d'intérêt socio-économique. Dans ce contexte, l'objectif de cette thèse est de combiner différentes données omiques hétérogènes afin de mieux décrire et comprendre le dernier tiers de gestation chez le porc, période influençant la mortinatalité porcine. Durant cette thèse, nous avons identifié les bases moléculaires et cellulaires sous-jacentes de la fin de gestation, en particulier au niveau du muscle squelettique. Ce tissu est en effet déterminant à la naissance car impliqué dans l'efficacité de plusieurs fonctions physiologiques comme la thermorégulation et la capacité à se déplacer. Au niveau du plan expérimental, les tissus analysés proviennent de foetus prélevés à 90 et 110 jours de gestation (naissance à 114 jours), issus de deux lignées extrêmes pour la mortalité à la naissance, Large White et Meishan, et des deux croisements réciproques. Au travers l'application de plusieurs études statistiques et computationnelles (analyses multidimensionnelles, inférence de réseaux, clustering et intégration de données), nous avons montré l'existence de mécanismes biologiques régulant la maturité musculaire chez les porcelets, mais également chez d'autres espèces d'intérêt agronomique (bovin et mouton). Quelques gènes et protéines ont été identifiées comme étant fortement liées à la mise en place du métabolisme énergétique musculaire durant le dernier tiers de gestation. Les porcelets ayant une immaturité du métabolisme musculaire seraient sujets à un plus fort risque de mortalité à la naissance. Un second volet de cette thèse concerne l'imputation de données manquantes (tout un groupe de variables pour un individu) dans les méthodes d'analyses multidimensionnelles, comme l'analyse factorielle multiple (AFM) (ou multiple factor analysis (MFA)). Dans notre contexte, l'AFM fut particulièrement intéressante pour l'intégration de données d'un ensemble d'individus sur différents tissus (deux ou plus). Afin de conserver ces individus manquants pour tout un groupe de variables, nous avons développé une méthode, appelée MI-MFA (multiple imputation - MFA), permettant l'estimation des composantes de l'AFM pour ces individus manquants
Over the last decades, some omics data integration studies have been developed to participate in the detailed description of complex traits with socio-economic interests. In this context, the aim of the thesis is to combine different heterogeneous omics data to better describe and understand the last third of gestation in pigs, period influencing the piglet mortality at birth. In the thesis, we better defined the molecular and cellular basis underlying the end of gestation, with a focus on the skeletal muscle. This tissue is specially involved in the efficiency of several physiological functions, such as thermoregulation and motor functions. According to the experimental design, tissues were collected at two days of gestation (90 or 110 days of gestation) from four fetal genotypes. These genotypes consisted in two extreme breeds for mortality at birth (Meishan and Large White) and two reciprocal crosses. Through statistical and computational analyses (descriptive analyses, network inference, clustering and biological data integration), we highlighted some biological mechanisms regulating the maturation process in pigs, but also in other livestock species (cattle and sheep). Some genes and proteins were identified as being highly involved in the muscle energy metabolism. Piglets with a muscular metabolism immaturity would be associated with a higher risk of mortality at birth. A second aspect of the thesis was the imputation of missing individual row values in the multidimensional statistical method framework, such as the multiple factor analysis (MFA). In our context, MFA was particularly interesting in integrating data coming from the same individuals on different tissues (two or more). To avoid missing individual row values, we developed a method, called MI-MFA (multiple imputation - MFA), allowing the estimation of the MFA components for these missing individuals

APA, Harvard, Vancouver, ISO, and other styles

39

Czerwińska, Urszula. "Unsupervised deconvolution of bulk omics profiles : methodology and application to characterize the immune landscape in tumors Determining the optimal number of independent components for reproducible transcriptomic data analysis Application of independent component analysis to tumor transcriptomes reveals specific and reproducible immune-related signals A multiscale signalling network map of innate immune response in cancer reveals signatures of cell heterogeneity and functional polarization." Thesis, Sorbonne Paris Cité, 2018. http://www.theses.fr/2018USPCB075.

Full text

Abstract:

Les tumeurs sont entourées d'un microenvironnement complexe comprenant des cellules tumorales, des fibroblastes et une diversité de cellules immunitaires. Avec le développement actuel des immunothérapies, la compréhension de la composition du microenvironnement tumoral est d'une importance critique pour effectuer un pronostic sur la progression tumorale et sa réponse au traitement. Cependant, nous manquons d'approches quantitatives fiables et validées pour caractériser le microenvironnement tumoral, facilitant ainsi le choix de la meilleure thérapie. Une partie de ce défi consiste à quantifier la composition cellulaire d'un échantillon tumoral (appelé problème de déconvolution dans ce contexte), en utilisant son profil omique de masse (le profil quantitatif global de certains types de molécules, tels que l'ARNm ou les marqueurs épigénétiques). La plupart des méthodes existantes utilisent des signatures prédéfinies de types cellulaires et ensuite extrapolent cette information à des nouveaux contextes. Cela peut introduire un biais dans la quantification de microenvironnement tumoral dans les situations où le contexte étudié est significativement différent de la référence. Sous certaines conditions, il est possible de séparer des mélanges de signaux complexes, en utilisant des méthodes de séparation de sources et de réduction des dimensions, sans définitions de sources préexistantes. Si une telle approche (déconvolution non supervisée) peut être appliquée à des profils omiques de masse de tumeurs, cela permettrait d'éviter les biais contextuels mentionnés précédemment et fournirait un aperçu des signatures cellulaires spécifiques au contexte. Dans ce travail, j'ai développé une nouvelle méthode appelée DeconICA (Déconvolution de données omiques de masse par l'analyse en composantes immunitaires), basée sur la méthodologie de séparation aveugle de source. DeconICA a pour but l'interprétation et la quantification des signaux biologiques, façonnant les profils omiques d'échantillons tumoraux ou de tissus normaux, en mettant l'accent sur les signaux liés au système immunitaire et la découverte de nouvelles signatures. Afin de rendre mon travail plus accessible, j'ai implémenté la méthode DeconICA en tant que librairie R. En appliquant ce logiciel aux jeux de données de référence, j'ai démontré qu'il est possible de quantifier les cellules immunitaires avec une précision comparable aux méthodes de pointe publiées, sans définir a priori des gènes spécifiques au type cellulaire. DeconICA peut fonctionner avec des techniques de factorisation matricielle telles que l'analyse indépendante des composants (ICA) ou la factorisation matricielle non négative (NMF). Enfin, j'ai appliqué DeconICA à un grand volume de données : plus de 100 jeux de données, contenant au total plus de 28 000 échantillons de 40 types de tumeurs, générés par différentes technologies et traités indépendamment. Cette analyse a démontré que les signaux immunitaires basés sur l'ICA sont reproductibles entre les différents jeux de données. D'autre part, nous avons montré que les trois principaux types de cellules immunitaires, à savoir les lymphocytes T, les lymphocytes B et les cellules myéloïdes, peuvent y être identifiés et quantifiés. Enfin, les métagènes dérivés de l'ICA, c'est-à-dire les valeurs de projection associées à une source, ont été utilisés comme des signatures spécifiques permettant d'étudier les caractéristiques des cellules immunitaires dans différents types de tumeurs. L'analyse a révélé une grande diversité de phénotypes cellulaires identifiés ainsi que la plasticité des cellules immunitaires, qu'elle soit dépendante ou indépendante du type de tumeur. Ces résultats pourraient être utilisés pour identifier des cibles médicamenteuses ou des biomarqueurs pour l'immunothérapie du cancer
Tumors are engulfed in a complex microenvironment (TME) including tumor cells, fibroblasts, and a diversity of immune cells. Currently, a new generation of cancer therapies based on modulation of the immune system response is in active clinical development with first promising results. Therefore, understanding the composition of TME in each tumor case is critically important to make a prognosis on the tumor progression and its response to treatment. However, we lack reliable and validated quantitative approaches to characterize the TME in order to facilitate the choice of the best existing therapy. One part of this challenge is to be able to quantify the cellular composition of a tumor sample (called deconvolution problem in this context), using its bulk omics profile (global quantitative profiling of certain types of molecules, such as mRNA or epigenetic markers). In recent years, there was a remarkable explosion in the number of methods approaching this problem in several different ways. Most of them use pre-defined molecular signatures of specific cell types and extrapolate this information to previously unseen contexts. This can bias the TME quantification in those situations where the context under study is significantly different from the reference. In theory, under certain assumptions, it is possible to separate complex signal mixtures, using classical and advanced methods of source separation and dimension reduction, without pre-existing source definitions. If such an approach (unsupervised deconvolution) is feasible to apply for bulk omic profiles of tumor samples, then this would make it possible to avoid the above mentioned contextual biases and provide insights into the context-specific signatures of cell types. In this work, I developed a new method called DeconICA (Deconvolution of bulk omics datasets through Immune Component Analysis), based on the blind source separation methodology. DeconICA has an aim to decipher and quantify the biological signals shaping omics profiles of tumor samples or normal tissues. A particular focus of my study was on the immune system-related signals and discovering new signatures of immune cell types. In order to make my work more accessible, I implemented the DeconICA method as an R package named "DeconICA". By applying this software to the standard benchmark datasets, I demonstrated that DeconICA is able to quantify immune cells with accuracy comparable to published state-of-the-art methods but without a priori defining a cell type-specific signature genes. The implementation can work with existing deconvolution methods based on matrix factorization techniques such as Independent Component Analysis (ICA) or Non-Negative Matrix Factorization (NMF). Finally, I applied DeconICA to a big corpus of data containing more than 100 transcriptomic datasets composed of, in total, over 28000 samples of 40 tumor types generated by different technologies and processed independently. This analysis demonstrated that ICA-based immune signals are reproducible between datasets and three major immune cell types: T-cells, B-cells and Myeloid cells can be reliably identified and quantified. Additionally, I used the ICA-derived metagenes as context-specific signatures in order to study the characteristics of immune cells in different tumor types. The analysis revealed a large diversity and plasticity of immune cells dependent and independent on tumor type. Some conclusions of the study can be helpful in identification of new drug targets or biomarkers for immunotherapy of cancer

APA, Harvard, Vancouver, ISO, and other styles

40

Abily-Donval, Lénaïg. "Exploration des mécanismes physiopathologiques des mucopolysacharidoses et de la maladie de Fabry par approches "omiques" et modulation de l'autophagie. Urinary metabolic phenotyping of mucopolysaccharidosis type I combining untargeted and targeted strategies with data modeling Unveiling metabolic remodeling in mucopolysaccharidosis type III through integrative metabolomics and pathway analysis." Thesis, Normandie, 2019. http://www.theses.fr/2019NORMR108.

Full text

Abstract:

Les pathologies lysosomales sont des maladies liées au déficit quantitatif ou qualitatif d’une hydrolase ou d’un transporteur à l’origine d’une atteinte multiviscérale potentiellement sévère. Certaines de ces pathologies sont accessibles à des traitements mais ces thérapeutiques sont uniquement symptomatiques et ne guérissent pas les patients. Même si le phénomène de surcharge peut expliquer entre autres la symptomatologie observée, la physiopathologie de ces maladies est complexe et non précisément connue. Une meilleure connaissance de ces pathologies pourrait permettre d’améliorer leur prise en charge globale. L’objectif de ce travail était dans un premier temps d’appliquer des techniques « omiques » dans deux groupes de maladies : les mucopolysaccharidoses et la maladie de Fabry. Cette étude a permis la mise en place d’une méthodologie métabolomique non ciblée basée sur une stratégie analytique multidimensionnelle comportant la spectrométrie de masse à haute résolution couplée à la chromatographie liquide ultra-haute performance et la mobilité ionique. Dans les mucopolysaccharidoses, l’étude des voies métaboliques a mis en évidence des modifications dans le métabolisme de plusieurs acides aminés et du système oxydatif du glutathion. Dans la maladie de Fabry, des modifications ont été observées dans l’expression de l’interleukine 7 et du facteur de croissance FGF2. La deuxième partie du travail s’est intéressée à la modulation de l’autophagie dans la maladie de Fabry. Notre étude a montré une diminution du flux autophagique avec un retard d’adressage de l’enzyme au lysosome dans les cellules Fabry. L’inhibition de l’autophagie permet de diminuer l’accumulation du substrat accumulé (Gb3) et améliore l’efficacité de l’enzymothérapie substitutive. En conclusion ce travail a permis une meilleure compréhension des mécanismes physiopathologiques des pathologies lysosomales et a montré la complexité du fonctionnement du lysosome. Ces données permettent d’espérer l’amélioration des stratégies thérapeutiques et diagnostiques dans ces maladies
Lysosomal diseases caused by quantitative or qualitative hydrolase or transporter defect induce multiorgan features. Some specific symptomatic treatments are available but they do not cure patients. Pathophysiological bases of lysosomal disease are poorly understood and cannot be due to storage only. A better knowledge of these pathologies could improve their management. The first aim of this study was to apply “omics” strategies in mucopolysaccharidosis and in Fabry disease. This thesis allowed the implementation of an untargeted metabolomic methodology based on a multidimensional analytical strategy including high-resolution mass spectrometry coupled with ultra-high-performance liquid chromatography and ion mobility. Analysis of metabolic pathways showed a major remodeling of the amino acid metabolisms as well as oxidative stress via glutathione metabolism. In Fabry disease, changes were observed in expression of interleukin 7 and FGF2. The second study focused on modulation of autophagy in Fabry disease. In this work, we have shown a disruption of the autophagic process and a delay in enzyme targeting to the lysosome in Fabry disease. Autophagic inhibition reduced accumulation of accumulated substrate (Gb3) and improved the efficiency of enzyme replacement therapy. This work allowed a better knowledge of the physiopathological mechanisms implicated in lysosomal diseases and showed the complexity of lysosome. These data could ameliorate management of these disease and are associated with hope for patients

APA, Harvard, Vancouver, ISO, and other styles

41

Hulot, Audrey. "Analyses de données omiques : clustering et inférence de réseaux Female ponderal index at birth and idiopathic infertility." Thesis, université Paris-Saclay, 2020. http://www.theses.fr/2020UPASL034.

Full text

Abstract:

Le développement des méthodes de biologie haut-débit (séquençage et spectrométrie de masse) a permis de générer de grandes masses de données, dites -omiques, qui nous aident à mieux comprendre les processus biologiques.Cependant, isolément, chaque source -omique ne permet d'expliquer que partiellement ces processus. Mettre en relation les différentes sources de donnés -omiques devrait permettre de mieux comprendre les processus biologiques mais constitue un défi considérable.Dans cette thèse, nous nous intéressons particulièrement aux méthodes de clustering et d’inférence de réseaux, appliquées aux données -omiques.La première partie du manuscrit présente trois méthodes. Les deux premières méthodes sont applicables dans un contexte où les données peuvent être de nature hétérogène.La première concerne un algorithme d’agrégation d’arbres, permettant la construction d’un clustering hiérarchique consensus. La complexité sous-quadratique de cette méthode a fait l’objet d’une démonstration, et permet son application dans un contexte de grande dimension. Cette méthode est disponible dans le package R mergeTrees, accessible sur le CRAN.La seconde méthode concerne l’intégration de données provenant d’arbres ou de réseaux, en transformant les objets via la distance cophénétique ou via le plus court chemin, en matrices de distances. Elle utilise le Multidimensional Scaling et l’Analyse Factorielle Multiple et peut servir à la construction d’arbres et de réseaux consensus.Enfin, dans une troisième méthode, on se place dans le contexte des modèles graphiques gaussiens, et cherchons à estimer un graphe, ainsi que des communautés d’entités, à partir de plusieurs tables de données. Cette méthode est basée sur la combinaison d’un Stochastic Block Model, un Latent block Model et du Graphical Lasso.Cette thèse présente en deuxième partie les résultats d’une étude de données transcriptomiques et métagénomiques, réalisée dans le cadre d’un projet appliqué, sur des données concernant la Spondylarthrite ankylosante
The development of biological high-throughput technologies (next-generation sequencing and mass spectrometry) have provided researchers with a large amount of data, also known as -omics, that help better understand the biological processes.However, each source of data separately explains only a very small part of a given process. Linking the differents -omics sources between them should help us understand more of these processes.In this manuscript, we will focus on two approaches, clustering and network inference, applied to omics data.The first part of the manuscript presents three methodological developments on this topic. The first two methods are applicable in a situation where the data are heterogeneous.The first method is an algorithm for aggregating trees, in order to create a consensus out of a set of trees. The complexity of the process is sub-quadratic, allowing to use it on data leading to a great number of leaves in the trees. This algorithm is available in an R-package named mergeTrees on the CRAN.The second method deals with the integration data from trees and networks, by transforming these objects into distance matrices using cophenetic and shortest path distances, respectively. This method relies on Multidimensional Scaling and Multiple Factor Analysis and can be also used to build consensus trees or networks.Finally, we use the Gaussian Graphical Models setting and seek to estimate a graph, as well as communities in the graph, from several tables. This method is based on a combination of Stochastic Block Model, Latent Block Model and Graphical Lasso.The second part of the manuscript presents analyses conducted on transcriptomics and metagenomics data to identify targets to gain insight into the predisposition of Ankylosing Spondylitis

APA, Harvard, Vancouver, ISO, and other styles

42

Elmansy, Dalia F. "Computational Methods to Characterize the Etiology of Complex Diseases at Multiple Levels." Case Western Reserve University School of Graduate Studies / OhioLINK, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=case1583416431321447.

Full text

APA, Harvard, Vancouver, ISO, and other styles

43

Teng, Sin Yong. "Intelligent Energy-Savings and Process Improvement Strategies in Energy-Intensive Industries." Doctoral thesis, Vysoké učení technické v Brně. Fakulta strojního inženýrství, 2020. http://www.nusl.cz/ntk/nusl-433427.

Full text

Abstract:

S tím, jak se neustále vyvíjejí nové technologie pro energeticky náročná průmyslová odvětví, stávající zařízení postupně zaostávají v efektivitě a produktivitě. Tvrdá konkurence na trhu a legislativa v oblasti životního prostředí nutí tato tradiční zařízení k ukončení provozu a k odstavení. Zlepšování procesu a projekty modernizace jsou zásadní v udržování provozních výkonů těchto zařízení. Současné přístupy pro zlepšování procesů jsou hlavně: integrace procesů, optimalizace procesů a intenzifikace procesů. Obecně se v těchto oblastech využívá matematické optimalizace, zkušeností řešitele a provozní heuristiky. Tyto přístupy slouží jako základ pro zlepšování procesů. Avšak, jejich výkon lze dále zlepšit pomocí moderní výpočtové inteligence. Účelem této práce je tudíž aplikace pokročilých technik umělé inteligence a strojového učení za účelem zlepšování procesů v energeticky náročných průmyslových procesech. V této práci je využit přístup, který řeší tento problém simulací průmyslových systémů a přispívá následujícím: (i)Aplikace techniky strojového učení, která zahrnuje jednorázové učení a neuro-evoluci pro modelování a optimalizaci jednotlivých jednotek na základě dat. (ii) Aplikace redukce dimenze (např. Analýza hlavních komponent, autoendkodér) pro vícekriteriální optimalizaci procesu s více jednotkami. (iii) Návrh nového nástroje pro analýzu problematických částí systému za účelem jejich odstranění (bottleneck tree analysis – BOTA). Bylo také navrženo rozšíření nástroje, které umožňuje řešit vícerozměrné problémy pomocí přístupu založeného na datech. (iv) Prokázání účinnosti simulací Monte-Carlo, neuronové sítě a rozhodovacích stromů pro rozhodování při integraci nové technologie procesu do stávajících procesů. (v) Porovnání techniky HTM (Hierarchical Temporal Memory) a duální optimalizace s několika prediktivními nástroji pro podporu managementu provozu v reálném čase. (vi) Implementace umělé neuronové sítě v rámci rozhraní pro konvenční procesní graf (P-graf). (vii) Zdůraznění budoucnosti umělé inteligence a procesního inženýrství v biosystémech prostřednictvím komerčně založeného paradigmatu multi-omics.

APA, Harvard, Vancouver, ISO, and other styles

44

Wolf, Beat. "Reducing the complexity of OMICS data analysis." Doctoral thesis, 2017. https://nbn-resolving.org/urn:nbn:de:bvb:20-opus-153687.

Full text

Abstract:

The field of genetics faces a lot of challenges and opportunities in both research and diagnostics due to the rise of next generation sequencing (NGS), a technology that allows to sequence DNA increasingly fast and cheap. NGS is not only used to analyze DNA, but also RNA, which is a very similar molecule also present in the cell, in both cases producing large amounts of data. The big amount of data raises both infrastructure and usability problems, as powerful computing infrastructures are required and there are many manual steps in the data analysis which are complicated to execute. Both of those problems limit the use of NGS in the clinic and research, by producing a bottleneck both computationally and in terms of manpower, as for many analyses geneticists lack the required computing skills. Over the course of this thesis we investigated how computer science can help to improve this situation to reduce the complexity of this type of analysis. We looked at how to make the analysis more accessible to increase the number of people that can perform OMICS data analysis (OMICS groups various genomics data-sources). To approach this problem, we developed a graphical NGS data analysis pipeline aimed at a diagnostics environment while still being useful in research in close collaboration with the Human Genetics Department at the University of Würzburg. The pipeline has been used in various research papers on covering subjects, including works with direct author participation in genomics, transcriptomics as well as epigenomics. To further validate the graphical pipeline, a user survey was carried out which confirmed that it lowers the complexity of OMICS data analysis. We also studied how the data analysis can be improved in terms of computing infrastructure by improving the performance of certain analysis steps. We did this both in terms of speed improvements on a single computer (with notably variant calling being faster by up to 18 times), as well as with distributed computing to better use an existing infrastructure. The improvements were integrated into the previously described graphical pipeline, which itself also was focused on low resource usage. As a major contribution and to help with future development of parallel and distributed applications, for the usage in genetics or otherwise, we also looked at how to make it easier to develop such applications. Based on the parallel object programming model (POP), we created a Java language extension called POP-Java, which allows for easy and transparent distribution of objects. Through this development, we brought the POP model to the cloud, Hadoop clusters and present a new collaborative distributed computing model called FriendComputing. The advances made in the different domains of this thesis have been published in various works specified in this document
Das Gebiet der Genetik steht vor vielen Herausforderungen, sowohl in der Forschung als auch Diagnostik, aufgrund des "next generation sequencing" (NGS), eine Technologie die DNA immer schneller und billiger sequenziert. NGS wird nicht nur verwendet um DNA zu analysieren sondern auch RNA, ein der DNA sehr ähnliches Molekül, wobei in beiden Fällen große Datenmengen zu erzeugt werden. Durch die große Menge an Daten entstehen Infrastruktur und Benutzbarkeitsprobleme, da leistungsstarke Computerinfrastrukturen erforderlich sind, und es viele manuelle Schritte in der Datenanalyse gibt die kompliziert auszuführen sind. Diese beiden Probleme begrenzen die Verwendung von NGS in der Klinik und Forschung, da es einen Engpass sowohl im Bereich der Rechnerleistung als auch beim Personal gibt, da für viele Analysen Genetikern die erforderlichen Computerkenntnisse fehlen. In dieser Arbeit haben wir untersucht wie die Informatik helfen kann diese Situation zu verbessern indem die Komplexität dieser Art von Analyse reduziert wird. Wir haben angeschaut, wie die Analyse zugänglicher gemacht werden kann um die Anzahl Personen zu erhöhen, die OMICS (OMICS gruppiert verschiedene Genetische Datenquellen) Datenanalysen durchführen können. In enger Zusammenarbeit mit dem Institut für Humangenetik der Universität Würzburg wurde eine graphische NGS Datenanalysen Pipeline erstellt um diese Frage zu erläutern. Die graphische Pipeline wurde für den Diagnostikbereich entwickelt ohne aber die Forschung aus dem Auge zu lassen. Darum warum die Pipeline in verschiedenen Forschungsgebieten verwendet, darunter mit direkter Autorenteilname Publikationen in der Genomik, Transkriptomik und Epigenomik, Die Pipeline wurde auch durch eine Benutzerumfrage validiert, welche bestätigt, dass unsere graphische Pipeline die Komplexität der OMICS Datenanalyse reduziert. Wir haben auch untersucht wie die Leistung der Datenanalyse verbessert werden kann, damit die nötige Infrastruktur zugänglicher wird. Das wurde sowohl durch das optimieren der verfügbaren Methoden (wo z.B. die Variantenanalyse bis zu 18 mal schneller wurde) als auch mit verteiltem Rechnen angegangen, um eine bestehende Infrastruktur besser zu verwenden. Die Verbesserungen wurden in der zuvor beschriebenen graphischen Pipeline integriert, wobei generell die geringe Ressourcenverbrauch ein Fokus war. Um die künftige Entwicklung von parallelen und verteilten Anwendung zu unterstützen, ob in der Genetik oder anderswo, haben wir geschaut, wie man es einfacher machen könnte solche Applikationen zu entwickeln. Dies führte zu einem wichtigen informatischen Result, in dem wir, basierend auf dem Model von „parallel object programming“ (POP), eine Erweiterung der Java-Sprache namens POP-Java entwickelt haben, die eine einfache und transparente Verteilung von Objekten ermöglicht. Durch diese Entwicklung brachten wir das POP-Modell in die Cloud, Hadoop-Cluster und präsentieren ein neues Model für ein verteiltes kollaboratives rechnen, FriendComputing genannt. Die verschiedenen veröffentlichten Teile dieser Dissertation werden speziel aufgelistet und diskutiert

APA, Harvard, Vancouver, ISO, and other styles

45

Benjamin, Ashlee Marie. "Computational Processing of Omics Data: Implications for Analysis." Diss., 2013. http://hdl.handle.net/10161/8217.

Full text

Abstract:

In this work, I present four studies across the range of 'omics data types - a Genome- Wide Association Study for gene-by-sex interaction of obesity traits, computational models for transcription start site classification, an assessment of reference-based mapping methods for RNA-Seq data from non-model organisms, and a statistical model for open-platform proteomics data alignment.

Obesity is an increasingly prevalent and severe health concern with a substantial heritable component, and marked sex differences. We sought to determine if the effect of genetic variants also differed by sex by performing a genome-wide association study modeling the effect of genotype-by-sex interaction on obesity phenotypes. Genotype data from individuals in the Framingham Heart Study Offspring cohort were analyzed across five exams. Although no variants showed genome-wide significant gene-by-sex interaction in any individual exam, four polymorphisms displayed a consistent BMI association (P-values .00186 to .00010) across all five exams. These variants were clustered downstream of LYPLAL1, which encodes a lipase/esterase expressed in adipose tissue, a locus previously identified as having sex-specific effects on central obesity. Primary effects in males were in the opposite direction as females and were replicated in Framingham Generation 3. Our data support a sex-influenced association between genetic variation at the LYPLAL1 locus and obesity-related traits.

The application of deep sequencing to map 5' capped transcripts has confirmed the existence of at least two distinct promoter classes in metazoans: focused promot- ers with transcription start sites (TSSs) that occur in a narrowly defined genomic span and dispersed promoters with TSSs that are spread over a larger window. Pre- vious studies have explored the presence of genomic features, such as CpG islands and sequence motifs, in these promoter classes, and our collaborators recently inves- tigated the relationship with chromatin features. It was found that promoter classes are significantly differentiated by nucleosome organization and chromatin structure. Here, we present computational models supporting the stronger contribution of chro- matin features to the definition of dispersed promoters compared to focused start sites. Specifically, dispersed promoters display enrichment for well-positioned nucleosomes downstream of the TSS and a more clearly defined nucleosome free region upstream, while focused promoters have a less organized nucleosome structure, yet higher presence of RNA polymerase II. These differences extend to histone vari- ants (H2A.Z) and marks (H3K4 methylation), as well as insulator binding (such as CTCF), independent of the expression levels of affected genes.

The application of next-generation sequencing technology to gene expression quantification analysis, namely, RNA-Sequencing, has transformed the way in which gene expression studies are conducted and analyzed. These advances are of partic- ular interest to researchers studying non-model organisms, as the need for knowl- edge of sequence information is overcome. De novo assembly methods have gained widespread acceptance in the RNA-Seq community for non-model organisms with no true reference genome or transcriptome. While such methods have tremendous utility, computational complexity is still a significant challenge for organisms with large and complex genomes. Here we present a comparison of four reference-based mapping methods for non-human primate data. We explore mapping efficacy, correlation between computed expression values, and utility for differential expression analyses. We show that reference-based mapping methods indeed have utility in RNA-Seq analysis of mammalian data with no true reference, and that the details of mapping methods should be carefully considered when doing so. We find that shorter seed sequences, allowance of mismatches, and allowance of gapped alignments, in addition to splice junction gaps result in more sensitive alignments of non-human primate RNA-Seq data.

Open-platform proteomics experiments seek to quantify and identify the proteins present in biological samples. Much like differential gene expression analyses, it is often of interest to determine how protein abundance differs in various physiological conditions. Label free LC-MS/MS enables the rapid measurement of thousands of proteins, providing a wealth of peptide intensity information for differential analysis. However, the processing of raw proteomics data poses significant challenges that must be overcome prior to analysis. We specifically address the matching of peptide measurements across samples - an essential pre-processing step in every proteomics experiment. Presented here is a novel method for open-platform proteomics data alignment with the ability to incorporate previously unused aspects of the data, particularly ion mobility drift times and product ion data. Our results suggest that the inclusion of additional data results in higher numbers of more confident matches, without increasing the number of mismatches. We also show that the incorporation of product ion data can improve results dramatically. Based on these results, we argue that the incorporation of ion mobility drift times and product ion information are worthy pursuits. In addition, alignment methods should be flexible enough to utilize all available data, particularly with recent advancements in experimental separation methods. The addition of drift times and/or high energy to alignment methods and accurate mass and time (AMT) tag databases can greatly improve experimenters ability to identify measured peptides, reducing analysis costs and potentially the need to run additional experiments.

Dissertation

APA, Harvard, Vancouver, ISO, and other styles

46

"Sparse Models For Multimodal Imaging And Omics Data Integration." 2015.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

47

Nersisyan, Lilit. "Telomere analysis based on high-throughput multi-omics data." Doctoral thesis, 2017. https://ul.qucosa.de/id/qucosa%3A16297.

Full text

Abstract:

Telomeres are repeated sequences at the ends of eukaryotic chromosomes that play prominent role in normal aging and disease development. They are dynamic structures that normally shorten over the lifespan of a cell, but can be elongated in cells with high proliferative capacity. Telomere elongation in stem cells is an advantageous mechanism that allows them to maintain the regenerative capacity of tissues, however, it also allows for survival of cancer cells, thus leading to development of malignancies. Numerous studies have been conducted to explore the role of telomeres in health and disease. However, the majority of these studies have focused on consequences of extreme shortening of telomeres that lead to telomere dysfunction, replicative arrest or chromosomal instability. Very few studies have addressed the regulatory roles of telomeres, and the association of genomic, transcriptomic and epigenomic characteristics of a cell with telomere length dynamics. Scarcity of such studies is partially conditioned by the low-throughput nature of experimental approaches for telomere length measurement and the fact that they do not easily integrate with currently available high-throughput data. In this thesis, we have attempted to build algorithms, in silico pipelines and software packages to utilize high-throughput –omics data for telomere biology research. First, we have developed a software package Computel, to compute telomere length from whole genome next generation sequencing data. We show that it can be used to integrate telomere length dynamics into systems biology research. Using Computel, we have studied the association of telomere length with genomic variations in a healthy human population, as well as with transcriptomic and epigenomic features of lung cancers. Another aim of our study was to develop in silico models to assess the activity of telomere maintenance machanisms (TMM) based on gene expression data. There are two main TMMs: one based on the catalytic activity of ribonucleoprotein complex telomerase, and the other based on recombination events between telomeric sequences. Which type of TMM gets activated in a cancer cell determines the aggressiveness of the tumor and the outcome of the disease. Investigation into TMM mechanisms is valuable not only for basic research, but also for applied medicine, since many anticancer therapies attempt to inhibit the TMM in cancer cells to stop their growth. Therefore, studying the activation mechanisms and regulators of TMMs is of paramount importance for understanding cancer pathomechanisms and for treatment. Many studies have addressed this topic, however many aspects of TMM activation and realization still remain elusive. Additionally, current data-mining pipelines and functional annotation approaches of phenotype-associated genes are not adapted for identification of TMMs. To overcome these limitations, we have constructed pathway networks for the two TMMs based on literature, and have developed a methodology for assessment of TMM pathway activities from gene expression data. We have described the accuracy of our TMM-based approach on a set of cancer samples with experimentally validated TMMs. We have also applied it to explore TMM activity states in lung adenocarcinoma cell lines. In summary, recent developments of high-throughput technologies allow for production of data on multiple levels of cellular organization – from genomic and transcriptiomic to epigenomic. This has allowed for rapid development of various directions in molecular and cellular biology. In contrast, telomere research, although at the heart of stem cell and cancer studies, is still conducted with low-throughput experimental approaches. Here, we have attempted to utilize the huge amount of currently accumulated multi-omics data to foster telomere research and to bring it to systems biology scale.

APA, Harvard, Vancouver, ISO, and other styles

48

Costa, João Carlos Sequeira. "Development of an automated pipeline for meta-omics data analysis." Master's thesis, 2017. http://hdl.handle.net/1822/56113.

Full text

Abstract:

Dissertação de mestrado em Computer Science
Knowing what lies around us has been a goal for many decades now, and the new advances in sequencing technologies and in meta-omics approaches have permitted to start answering some of the main questions of microbiology - what is there, and what is it doing? The exponential growth of omics studies has been answered by the development of some bioinformatic tools capable of handling Metagenomics (MG) analysis, with a scarce few integrating such analysis with Metatranscriptomics (MT) or Metaproteomics (MP) studies. Furthermore, the existing tools for meta-omics analysis are usually not user friendly, usually limited to command-line usage. Because of the variety in meta-omics approaches, a standard workflow is not possible, but some routines exist, which may be implemented in a single tool, thereby facilitating the work of laboratory professionals. In the framework of this master thesis, a pipeline for integrative MG and MT data analysis was developed. This pipeline aims to retrieve comprehensive comparative gene/transcript expression results obtained from different biological samples. The user can access the data at the end of each step and summaries containing several parameters of evaluation of the previous step, and final graphical representations, like Krona plots and Differential Expression (DE) heatmaps. Several quality reports are also generated. The pipeline was constructed with tools tested and validated for meta-omics data analysis. Selected tools include FastQC, Trimmomatic and SortMeRNA for preprocessing, MetaSPAdes and Megahit for assembly, MetaQUAST and Bowtie2 for reporting on the quality of the assembly, FragGeneScan and DIAMOND for annotation and DeSEQ2 for DE analysis. Firstly, the tools were tested separately and then integrated in several python wrappers to construct the software Meta-Omics Software for Community Analysis (MOSCA). MOSCA performs preprocessing of MG and MT reads, assembly of the reads, annotation of the assembled contigs, and a final data analysis. Real datasets were used to test the capabilities of the tool. Since different types of files can be obtained along the workflow, it is possible to perform further analyses to obtain additional information and/or additional data representations, such as metabolic pathway mapping.
O objectivo da microbiologia, e em particular daqueles que se dedicam ao estudo de comunidades microbianas, é descobrir o que compõe as comunidades, e a função de cada microrganismo no seio da comunidade. Graças aos avanços nas técnicas de sequenciação, em particular no desenvolvimento de tecnologias de Next Generation Sequencing, surgiram abordagens de meta-ómicas que têm vindo a ajudar a responder a estas questões. Várias ferramentas foram desenvolvidas para lidar com estas questões, nomeadamente lidando com dados de Metagenómica (MG), e algumas poucas integrando esse tipo de análise com estudos de Metatranscriptómica (MT) e Metaproteómica (MP). Além da escassez de ferramentas bioinformáticas, as que já existem não costumam ser facilmente manipuláveis por utilizadores com pouca experiencia em informática, e estão frequentemente limitadas a uso por linha de comando. Um formato geral para uma ferramenta de análise meta-ómica não é possível devido à grande variedade de aplicações. No entanto, certas aplicações possuem certas rotinas, que são passíveis de serem implementadas numa ferramenta, facilitando assim o trabalho dos profissionais de laboratório. Nesta tese, uma pipeline integrada para análise de dados de MG e MT foi desenvolvida, pretendendo determinar a expressão de genes/transcriptos entre diferentes amostras biológicas. O utilizador tem disponíveis os resultados de cada passo, sumários com vários parâmetros para avaliação do procedimento, e representações gráficas como gráficos Krona e heatmaps de expressão diferencial. Vários relatórios sobre a qualidade dos resultados obtidos também são gerados. A ferramenta foi construída baseada em ferramentas e procedimentos testados e validados com análise de dados de meta-ómica. Essas ferramentas são FastQC, Trimmomatic e SortMeRNA para pré-processamento, Megahit e MetaSPAdes para assemblagem, MetaQUAST e Bowtie2 para controlo da qualidade dos contigs obtidos na assemblagem, FragGeneScan e DIAMOND para anotação e DeSEQ2 para análise de expressão diferencial. As ferramentas foram testadas uma a uma, e depois integradas em diferentes wrappers de python para compôr a Meta-Omics Software for Community Analysis (MOSCA). A MOSCA executa pré-processamento de reads de MG e MT, assemblagem das reads, anotação dos contigs assemblados, e uma análise de dados final Foram usados dados reais para testar as capacidades da MOSCA. Como podem ser obtidos diferentes tipos de ficheiros ao longo da execução da MOSCA, é possível levar a cabo análises posteriores para obter informação adicional e/ou representações de dados adicionais, como mapeamento de vias metabólicas.

APA, Harvard, Vancouver, ISO, and other styles

49

Choong, Wai-kok, and 鍾偉國. "A PPI-based GO functional enrichment analysis for “omics” data." Thesis, 2011. http://ndltd.ncl.edu.tw/handle/86506525034712415580.

Full text

Abstract:

碩士
國立陽明大學
生物醫學資訊研究所
99
With the popularization of high-throughput technology, enrichment tools have been rapidly developed for analysing large-scale ``omics'' data. However, most methods emphasize statistical significance rather then biological considerationand have difficulty assigning correct statistical significance to terms with few entities. It is therefore difficult for researchers to figure out accurate biological interpretation and assess the quality of Gene Ontology (GO) enrichment results. In this study, we introduce a new functional enrichment analysis strategy. It integrates: 1)comparative genes/proteins quantization from experiments 2)the evidence code of GO annotation for quality control 3)the interaction relationship provided by STRING to figure out the GO terms with accurate biological interpretation. The output is expected to be precise to describe the experiments. In addition, we provide several output styles with graphic visualization. The PPI within terms, the DAG structure and gene similarity between terms are considered to cluster enriched GO terms. Applying our strategy to the p53 +/- status expression dataset, the enriched term with the highest score is GO:0010640 (platelet-derived growth factor receptor signaling pathway, F3 gene, F7 gene), which is supported by literature. Since most of the top-ranked GO terms in the results are supported by previous study, we believe that the genes or proteins in the enriched terms have potential to be candidates for biomarker discovery or targets for experimantal design.

APA, Harvard, Vancouver, ISO, and other styles

50

Huang, Chia-Yu, and 黃家郁. "A Hybrid Analysis Method for Construction of Heterogeneous Network from Multi-Omics Data." Thesis, 2017. http://ndltd.ncl.edu.tw/handle/23261558416219789439.

Full text

Abstract:

碩士
國立臺灣大學
生醫電子與資訊學研究所
105
It has been illuminated that tumorigenesis is caused by an accumulation of perturbation of different layers of biomolecules. Therefore, there has been growing interest in the integrated analysis of multilayer omics data. The dimensionality, heterogeneity, and dependency of omics data necessitate an effective hybrid-analysis method for systematically exploring the associations and interactions between layers. No such method has been previously developed. In the present study, we aimed to develop a hybrid-analysis method that incorporates multi-omics data for systematically identifying the omics features related to the specific outcome, such as drug responsiveness and patients’ prognosis. These identified features were then presented by a network, using a node to represent a feature and an edge for correlation between features. Besides, the method could cluster a group of highly correlated omics features into a module, providing the putative interactions of biomolecules. The proposed method can be briefly divided into the following four steps. First, omics data were collected and conducted the normalization to transform each dataset into an appropriate scale. Next, we preselected the features of interest to reduce the dimension. Third, Least Absolute Shrinkage and Selection Operator (Lasso) estimator was introduced to identify representative nodes in each module. Finally, we built integral modules by correlation analyses. To test the feasibility of our method, the simulation study and two applications of public datasets, the Cancer Cell Line Encyclopedia (CCLE) and The Cancer Genome Atlas (TCGA), were conducted. The results of the simulation study demonstrated the feasibility of applying the lasso estimator in the hybrid-analysis method and suggested that improved performance can be achieved by integrating all layers of data simultaneously. Two feature networks were constructed, related to paclitaxel response and survival, respectively. The former network involved a total of 98 modules constituted by 2,033 features from 5 data types. Among them, the expression of ABCB1, which encodes multidrug transporters, was the most relevant factor for drug resistance and was expressed differentially among several cancer types. In addition, we identified the gene set “MICROTUBULE POLYMERIZATION OR DEPOLYMERIZATION”, which influences assembly or disassembly of microtubules, suggesting that paclitaxel affects the functions of microtubules as well as cell movement. In the second network, we identified a total of 266 features that jointly constructed 61 modules correlated with the risk of colon cancer. The mutation status of sulfatase-modifying factor 1 (SUMF1) and the potassium channel member 5 (KCNK5) were the top two most influential factors. Moreover, the loss of chromosome 1p and hypermethylation of multiple CpG loci on chromosome 7, including the sites in HOXA13, were identified associated with poor prognosis. It is expected that the results obtained here could promote the understanding of drug resistance mechanisms and tumor development and progression. To sum up, we developed an effective and robust hybrid-analysis method to investigate multi-omics networks with implications in drug response and prognosis of cancers. Its performance was corroborated using a simulation study and two real datasets. Our model is widely applicable to other omics data and is anticipated to facilitate the exploration of highly heterogeneous cancers.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Omics data analysi'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles