Dissertations / Theses: 'Biological data'

1

Rundqvist, David. "Grouping Biological Data." Thesis, Linköping University, Department of Computer and Information Science, 2006. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-6327.

Full text

Abstract:

Today, scientists in various biomedical fields rely on biological data sources in their research. Large amounts of information concerning, for instance, genes, proteins and diseases are publicly available on the internet, and are used daily for acquiring knowledge. Typically, biological data is spread across multiple sources, which has led to heterogeneity and redundancy.

The current thesis suggests grouping as one way of computationally managing biological data. A conceptual model for this purpose is presented, which takes properties specific for biological data into account. The model defines sub-tasks and key issues where multiple solutions are possible, and describes what approaches for these that have been used in earlier work. Further, an implementation of this model is described, as well as test cases which show that the model is indeed useful.

Since the use of ontologies is relatively new in the management of biological data, the main focus of the thesis is on how semantic similarity of ontological annotations can be used for grouping. The results of the test cases show for example that the implementation of the model, using Gene Ontology, is capable of producing groups of data entries with similar molecular functions.

APA, Harvard, Vancouver, ISO, and other styles

2

Hasegawa, Takanori. "Reconstructing Biological Systems Incorporating Multi-Source Biological Data via Data Assimilation Techniques." 京都大学 (Kyoto University), 2015. http://hdl.handle.net/2433/195985.

Full text

APA, Harvard, Vancouver, ISO, and other styles

3

Jakonienė, Vaida. "Integration of biological data /." Linköping : Linköpings universitet, 2006. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-7484.

Full text

APA, Harvard, Vancouver, ISO, and other styles

4

Jakonienė, Vaida. "Integration of Biological Data." Doctoral thesis, Linköpings universitet, IISLAB - Laboratoriet för intelligenta informationssystem, 2006. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-7484.

Full text

Abstract:

Data integration is an important procedure underlying many research tasks in the life sciences, as often multiple data sources have to be accessed to collect the relevant data. The data sources vary in content, data format, and access methods, which often vastly complicates the data retrieval process. As a result, the task of retrieving data requires a great deal of effort and expertise on the part of the user. To alleviate these difficulties, various information integration systems have been proposed in the area. However, a number of issues remain unsolved and new integration solutions are needed. The work presented in this thesis considers data integration at three different levels. 1) Integration of biological data sources deals with integrating multiple data sources from an information integration system point of view. We study properties of biological data sources and existing integration systems. Based on the study, we formulate requirements for systems integrating biological data sources. Then, we define a query language that supports queries commonly used by biologists. Also, we propose a high-level architecture for an information integration system that meets a selected set of requirements and that supports the specified query language. 2) Integration of ontologies deals with finding overlapping information between ontologies. We develop and evaluate algorithms that use life science literature and take the structure of the ontologies into account. 3) Grouping of biological data entries deals with organizing data entries into groups based on the computation of similarity values between the data entries. We propose a method that covers the main steps and components involved in similarity-based grouping procedures. The applicability of the method is illustrated by a number of test cases. Further, we develop an environment that supports comparison and evaluation of different grouping strategies. The work is supported by the implementation of: 1) a prototype for a system integrating biological data sources, called BioTRIFU, 2) algorithms for ontology alignment, and 3) an environment for evaluating strategies for similarity-based grouping of biological data, called KitEGA.

APA, Harvard, Vancouver, ISO, and other styles

5

Dost, Banu. "Optimization algorithms for biological data." Diss., [La Jolla] : University of California, San Diego, 2010. http://wwwlib.umi.com/cr/ucsd/fullcit?p3397170.

Full text

Abstract:

Thesis (Ph. D.)--University of California, San Diego, 2010.
Title from first page of PDF file (viewed March 23, 2010). Available via ProQuest Digital Dissertations. Vita. Includes bibliographical references (p. 149-159).

APA, Harvard, Vancouver, ISO, and other styles

6

Schmidberger, Markus. "Parallel Computing for Biological Data." Diss., lmu, 2009. http://nbn-resolving.de/urn:nbn:de:bvb:19-104921.

Full text

APA, Harvard, Vancouver, ISO, and other styles

7

BERNARDINI, GIULIA. "COMBINATORIAL METHODS FOR BIOLOGICAL DATA." Doctoral thesis, Università degli Studi di Milano-Bicocca, 2021. http://hdl.handle.net/10281/305220.

Full text

Abstract:

Lo scopo di questa tesi è di elaborare e analizzare metodi rigorosi dal punto di vista matematico per l’analisi di due tipi di dati biologici: dati relativi a pan-genomi e filogenesi. Con il termine “pan-genoma” si indica, in generale, un insieme di sequenze genomiche strettamente correlate (tipicamente appartenenti a individui della stessa specie) che si vogliano utilizzare congiuntamente come sequenze di riferimento per un’intera popolazione. Una filogenesi, invece, rappresenta le relazioni evolutive in un gruppo di entità, che siano esseri viventi, geni, lingue naturali, manoscritti antichi o cellule tumorali. Con l’eccezione di uno dei risultati presentati in questa tesi, relativo all’analisi di filogenesi tumorali, il taglio della dissertazione è prevalentemente teorico: lo scopo è studiare gli aspetti combinatori dei problemi affrontati, più che fornire soluzioni efficaci in pratica. Una conoscenza approfondita degli aspetti teorici di un problema, del resto, permette un'analisi matematicamente rigorosa delle soluzioni già esistenti, individuandone i punti deboli e quelli di forza, fornendo preziosi dettagli sul loro funzionamento e aiutando a decidere quali problemi vadano ulteriormente investigati. Oltretutto, è spesso il caso che nuovi risultati teorici (algoritmi, strutture dati o riduzioni ad altri problemi più noti) si possano direttamente applicare o adattare come soluzione ad un problema pratico, o come minimo servano ad ispirare lo sviluppo di nuovi metodi efficaci in pratica. La prima parte della tesi è dedicata a nuovi metodi per eseguire delle operazioni fondamentali su un testo elastico-degenerato, un oggetto computazionale che codifica in maniera compatta un insieme di testi simili tra loro, come, ad esempio, un pan-genoma. Nello specifico, si affrontano il problema di cercare una sequenza di lettere in un testo elastico-degenerato, sia in maniera esatta che tollerando un numero prefissato di errori, e quello di confrontare due testi degenerati. Nella seconda parte si considerano sia filogenesi tumorali, che ricostruiscono per l'appunto l'evoluzione di un tumore, sia filogenesi "classiche", che rappresentano, ad esempio, la storia evolutiva delle specie viventi. In particolare, si presentano nuove tecniche per confrontare due o più filogenesi tumorali, necessarie per valutare i risultati di diversi metodi che ricostruiscono le filogenesi stesse, e una nuova e più efficiente soluzione a un problema di lunga data relativo a filogenesi "classiche", consistente nel determinare se sia possibile sistemare, in presenza di dati mancanti, un insieme di specie in un albero filogenetico che abbia determinate proprietà.
The main goal of this thesis is to develop new algorithmic frameworks to deal with (i) a convenient representation of a set of similar genomes and (ii) phylogenetic data, with particular attention to the increasingly accurate tumor phylogenies. A “pan-genome” is, in general, any collection of genomic sequences to be analyzed jointly or to be used as a reference for a population. A phylogeny, in turn, is meant to describe the evolutionary relationships among a group of items, be they species of living beings, genes, natural languages, ancient manuscripts or cancer cells. With the exception of one of the results included in this thesis, related to the analysis of tumor phylogenies, the focus of the whole work is mainly theoretical, the intent being to lay firm algorithmic foundations for the problems by investigating their combinatorial aspects, rather than to provide practical tools for attacking them. Deep theoretical insights on the problems allow a rigorous analysis of existing methods, identifying their strong and weak points, providing details on how they perform and helping to decide which problems need to be further addressed. In addition, it is often the case where new theoretical results (algorithms, data structures and reductions to other well-studied problems) can either be directly applied or adapted to fit the model of a practical problem, or at least they serve as inspiration for developing new practical tools. The first part of this thesis is devoted to methods for handling an elastic-degenerate text, a computational object that compactly encodes a collection of similar texts, like a pan-genome. Specifically, we attack the problem of matching a sequence in an elastic-degenerate text, both exactly and allowing a certain amount of errors, and the problem of comparing two degenerate texts. In the second part we consider both tumor phylogenies, describing the evolution of a tumor, and “classical” phylogenies, representing, for instance, the evolutionary history of the living beings. In particular, we present new techniques to compare two or more tumor phylogenies, needed to evaluate the results of different inference methods, and we give a new, efficient solution to a longstanding problem on “classical” phylogenies: to decide whether, in the presence of missing data, it is possible to arrange a set of species in a phylogenetic tree that enjoys specific properties.

APA, Harvard, Vancouver, ISO, and other styles

8

Chakraborty, Ushashi. "Finding the Most Predictive Data Source in Biological Data." Thesis, North Dakota State University, 2013. https://hdl.handle.net/10365/26567.

Full text

Abstract:

Classification can be used to predict unknown functions of proteins by using known function information. In some cases, multiple sets of data are available for classification where prediction is only part of the problem, and knowing the most reliable source for prediction is also relevant. Our goal is to develop classification techniques to find the most predictive of the multiple data sets that we have in this project. We use existing classification techniques like linear and quadratic classifications and statistical relevance measures like posterior and log p analysis in our proposed algorithm, which is able to find the data set that is expected to give the best prediction. The proposed algorithm is used on experimental readings during cell cycle of yeast and it predicts the genes that participate in cell-cycle regulation and the type of experiment that provides evidence of cell cycle involvement for any particular gene.

APA, Harvard, Vancouver, ISO, and other styles

9

Gel, Moreno Bernat. "Dissemination and visualisation of biological data." Doctoral thesis, Universitat Politècnica de Catalunya, 2014. http://hdl.handle.net/10803/283143.

Full text

Abstract:

With the recent advent of various waves of technological advances, the amount of biological data being generated has exploded. As a consequence of this data deluge, new challenges have emerged in the field of biological data management. In order to maximize the knowledge extracted from the huge amount of biological data produced it is of great importance for the research community that data dissemination and visualisation challenges are tackled. Opening and sharing our data and working collaboratively will benefit the scientific community as a whole and to move towards that end, new developements, tools and techniques are needed. Nowadays, many small research groups are capable of producing important and interesting datasets. The release of those datasets can greatly increase their scientific value. In addition, the development of new data analysis algorithms greatly benefits from the availability of a big corpus of annotated datasets for training and testing purposes, giving new and better algorithms to biomedical sciences in return. None of these would be feasible without large amounts of biological data made freely and publicly available. Dissemination The Distributed Annotation System (DAS) is a protocol designed to publish and integrate annotations on biological entities in a distributed way. DAS is structured as a client-server system where the client retrieves data from one or more servers and to further process and visualise. Nowadays, setting up a DAS server imposes some requirements not met by many research groups. With the aim of removing the hassle of setting up a DAS server, a new software platform has been developed: easyDAS. easyDAS is a hosted platform to automatically create DAS servers. Using a simple web interface the user can upload a data file, describe its contents and a new DAS server will be automatically created and data will be publicly available to DAS clients. Visualisation One of the most broadly used visualization paradigms for genomic data are genomic browsers. A genomic browser is capable of displaying different sets of features positioned relative to a sequence. It is possible to explore the sequence and the features by moving around and zooming in and out. When this project was started, in 2007, all major genome browsers offered quite an static experience. It was possible to browse and explore data, but is was done through a set of buttons to the genome a certain amount of bases to left or right or zooming in and out. From an architectural point of view, all web-based genome browsers were very similar: they all had a relatively thin clien-side part in charge of showing images and big backend servers taking care of everything else. Every change in the display parameters made by the user triggered a request to the server, impacting the perceived responsiveness. We created a new prototype genome browser called GenExp, an interactive web-based browser with canvas based client side data rendering. It offers fluid direct interaction with the genome representation and it's possible to use the mouse drag it and use the mouse wheel to change the zoom level. GenExp offers also some quite unique features, such as its multi-window capabilities that allow a user to create an arbitrary number of independent or linked genome windows and its ability to save and share browsing sessions. GenExp is a DAS client and all data is retrieved from DAS sources. It is possible to add any available DAS data source including all data in Ensembl, UCSC and even the custom ones created with easyDAS. In addition, we developed a javascript DAS client library, jsDAS. jsDAS is a complete DAS client library that will take care of everything DAS related in a javascript application. jsDAS is javascript library agnostic and can be used to add DAS capabilities to any web application. All software developed in this thesis is freely available under an open source license.
Les recents millores tecnològiques han portat a una explosió en la quantitat de dades biològiques que es generen i a l'aparició de nous reptes en el camp de la gestió de les dades biològiques. Per a maximitzar el coneixement que podem extreure d'aquestes ingents quantitats de dades cal que solucionem el problemes associats al seu anàlisis, i en particular a la seva disseminació i visualització. La compartició d'aquestes dades de manera lliure i gratuïta pot beneficiar en gran mesura a la comunitat científica i a la societat en general, però per a fer-ho calen noves eines i tècniques. Actualment, molts grups són capaços de generar grans conjunts de dades i la seva publicació en pot incrementar molt el valor científic. A més, la disponibilitat de grans conjunts de dades és necessària per al desenvolupament de nous algorismes d'anàlisis. És important, doncs, que les dades biològiques que es generen siguin accessibles de manera senzilla, estandaritzada i lliure. Disseminació El Sistema d'Anotació Distribuïda (DAS) és un protocol dissenyat per a la publicació i integració d'anotacions sobre entitats biològiques de manera distribuïda. DAS segueix una esquema de client-servidor, on el client obté dades d'un o més servidors per a combinar-les, processar-les o visualitzar-les. Avui dia, però, crear un servidor DAS necessita uns coneixements i infraestructures que van més enllà dels recursos de molts grups de recerca. Per això, hem creat easyDAS, una plataforma per a la creació automàtica de servidors DAS. Amb easyDAS un usuari pot crear un servidor DAS a través d'una senzilla interfície web i amb només alguns clics. Visualització Els navegadors genomics són un dels paradigmes de de visualització de dades genòmiques més usats i permet veure conjunts de dades posicionades al llarg d'una seqüència. Movent-se al llarg d'aquesta seqüència és possibles explorar aquestes dades. Quan aquest projecte va començar, l'any 2007, tots els grans navegadors genomics oferien una interactivitat limitada basada en l'ús de botons. Des d'un punt de vista d'arquitectura tots els navegadors basats en web eren molt semblants: un client senzill encarregat d'ensenyar les imatges i un servidor complex encarregat d'obtenir les dades, processar-les i generar les imatges. Així, cada canvi en els paràmetres de visualització requeria una nova petició al servidor, impactant molt negativament en la velocitat de resposta percebuda. Vam crear un prototip de navegador genòmic anomenat GenExp. És un navegador interactiu basat en web que fa servir canvas per a dibuixar en client i que ofereix la possibilitatd e manipulació directa de la respresentació del genoma. GenExp té a més algunes característiques úniques com la possibilitat de crear multiples finestres de visualització o la possibilitat de guardar i compartir sessions de navegació. A més, com que és un client DAS pot integrar les dades de qualsevol servidor DAS com els d'Ensembl, UCSC o fins i tot aquells creats amb easyDAS. A més, hem desenvolupat jsDAS, la primera llibreria de client DAS completa escrita en javascript. jsDAS es pot integrar en qualsevol aplicació DAS per a dotar-la de la possibilitat d'accedir a dades de servidors DAS. Tot el programari desenvolupat en el marc d'aquesta tesis està lliurement disponible i sota una llicència de codi lliure.

APA, Harvard, Vancouver, ISO, and other styles

10

Droop, Alastair Philip. "Correlation Analysis of Multivariate Biological Data." Thesis, University of York, 2009. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.507622.

Full text

APA, Harvard, Vancouver, ISO, and other styles

11

McCormick, Paul Stephen. "Statistical analysis of biological expression data." Thesis, University of Cambridge, 2006. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.613819.

Full text

APA, Harvard, Vancouver, ISO, and other styles

12

Bokhari, Yahya. "DISCOVERING DRIVER MUTATIONS IN BIOLOGICAL DATA." VCU Scholars Compass, 2018. https://scholarscompass.vcu.edu/etd/5637.

Full text

Abstract:

Background Somatic mutations accumulate in human cells throughout life. Some may have no adverse consequences, but some of them may lead to cancer. A cancer genome is typically unstable, and thus more mutations can accumulate in the DNA of cancer cells. An ongoing problem is to figure out which mutations are drivers - play a role in oncogenesis, and which are passengers - do not play a role. One way of addressing this question is through inspection of somatic mutations in DNA of cancer samples from a cohort of patients and detection of patterns that differentiate driver from passenger mutations. Results We propose QuaDMutEx an QuadMutNetEx, a method that incorporates three novel elements: a new gene set penalty that includes non-linear penalization of multiple mutations in putative sets of driver genes, an ability to adjust the method to handle slow- and fast-evolving tumors, and a computationally efficient method for finding gene sets that minimize the penalty, through a combination of heuristic Monte Carlo optimization and exact binary quadratic programming. QuaDMutNetEx is our proposed method that combines protein-protein interaction networks to the method elements of QuaDMutEx. In particular, QuaDMutEx incorporates three novel elements: a non-linear penalization of multiple mutations in putative sets of driver genes, an ability to adjust the method to handle slow- and fast-evolving tumors, and a computationally efficient method for finding gene sets that minimize the penalty. In the new method, we incorporated a new quadratic rewarding term that prefers gene solution set that is connected with respect to protein-protein interaction networks. Compared to existing methods, the proposed algorithm finds sets of putative driver genes that show higher coverage and lower excess coverage in eight sets of cancer samples coming from brain, ovarian, lung, and breast tumors. Conclusions Superior ability to improve on both coverage and excess coverage on different types of cancer shows that QuaDMutEx and QuaDMutNetEx are tools that should be part of a state-of-the-art toolbox in the driver gene discovery pipeline. It can detect genes harboring rare driver mutations that may be missed by existing methods.

APA, Harvard, Vancouver, ISO, and other styles

13

Luo, Jun. "Mining algorithms for generic and biological data." [Gainesville, Fla.]: University of Florida, 2002. http://purl.fcla.edu/fcla/etd/UFE0000567.

Full text

APA, Harvard, Vancouver, ISO, and other styles

14

Zou, Cunlu. "Applications of Granger causality to biological data." Thesis, University of Warwick, 2010. http://wrap.warwick.ac.uk/35694/.

Full text

Abstract:

In computational biology, one often faces the problem of deriving the causal relationship among different elements such as genes, proteins, metabolites, neurons and so on, based upon multi-dimensional temporal data. In literature, there are several well-established reverse-engineering approaches to explore causal relationships in a dynamic network, such as ordinary differential equations (ODE), Bayesian networks, information theory and Granger Causality. To apply the four different approaches to the same problem, a key issue is to choose which approach is used to tackle the data, in particular when they give rise to contradictory results. In this thesis, I provided an answer by focusing on a systematic and computationally intensive comparison between the two common approaches which are dynamic Bayesian network inference and Granger causality. The comparison was carried out on both synthesized and experimental data. It is concluded that the dynamic Bayesian network inference performs better than the Granger causality approach, when the data size is short; otherwise the Granger causality approach is better. Since the Granger causality approach is able to detect weak interactions when the time series are long enough, I then focused on applying Granger causality approach on real experimental data both in the time and frequency domain and in local and global networks. For a small gene network, Granger causality outperformed all the other three approaches mentioned above. A global protein network of 812 proteins was reconstructed, using a novel approach. The obtained results fitted well with known experimental findings and predicted many experimentally testable results. In addition to interactions in the time domain, interactions in the frequency domain were also recovered. In addition to gene and protein data, Granger causality approach was also applied on Local Field Potential (LFP) data. Here we have combined multiarray electrophysiological recordings of local field potentials in both right inferior temporal (rIT) and left IT (lIT) and right anterior cingulate (rAC) cortices in sheep with Granger causality to investigate how anaesthesia alters processing during resting state and exposure to pictures of faces. Results from both the time and frequency domain analyses show that loss of consciousness during anaesthesia is associated with a reduction/disruption of feed forward open-loop cortico-cortical connections and a corresponding increase in shorter-distance closed loop ones.

APA, Harvard, Vancouver, ISO, and other styles

15

Waterworth, Alan Richard. "Data analysis techniques of measured biological impedance." Thesis, University of Sheffield, 2000. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.340146.

Full text

APA, Harvard, Vancouver, ISO, and other styles

16

Styczynski, Mark Philip-Walter. "Applications of motif discovery in biological data." Thesis, Massachusetts Institute of Technology, 2007. http://hdl.handle.net/1721.1/38976.

Full text

Abstract:

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Chemical Engineering, 2007.
Includes bibliographical references (p. 437-458).
Sequential motif discovery, the ability to identify conserved patterns in ordered datasets without a priori knowledge of exactly what those patterns will be, is a frequently encountered and difficult problem in computational biology and biochemical engineering. The most prevalent example of such a problem is finding conserved DNA sequences in the upstream regions of genes that are believed to be coregulated. Other examples are as diverse as identifying conserved secondary structure in proteins and interpreting time-series data. This thesis creates a unified, generic approach to addressing these (and other) problems in sequential motif discovery and demonstrates the utility of that approach on a number of applications. A generic motif discovery algorithm was created for the purpose of finding conserved patterns in arbitrary data types. This approach and implementation, name Gemoda, decouples three key steps in the motif discovery process: comparison, clustering, and convolution. Since it decouples these steps, Gemoda is a modular algorithm; that is, any comparison metric can be used with any clustering algorithm and any convolution scheme. The comparison metric is a data-specific function that transforms the motif discovery problem into a solvable graph-theoretic problem that still adequately represents the important similarities in the data.
(cont.) This thesis presents the development of Gemoda as well as applications of this approach in a number of different contexts. One application is an exhaustive solution of an abstraction of the transcription factor binding site discovery problem in DNA. A similar application is to the analysis of upstream regions of regulons in microbial DNA. Another application is the identification of protein sequence homologies in a set of related proteins in the presence of significant noise. A quite different application is the discovery of extended local secondary structure homology between a protein and a protein complex known to be in the same structural family. The final application is to the analysis of metabolomic datasets. The diversity of these sample applications, which range from the analysis of strings (like DNA and amino acid sequences) to real-valued data (like protein structures and metabolomic datasets) demonstrates that our generic approach is successful and useful for solving established and novel problems alike. The last application, of analyzing metabolomic datasets, is of particular interest. Using Gemoda, an appropriate comparison function, and appropriate data handling, a novel and useful approach to the interpretation of metabolite profiling datasets obtained from gas chromatography coupled to mass spectrometry is developed.
(cont.) The use of a motif discovery approach allows for the expansion of the scope of metabolites that can be tracked and analyzed in an untargeted metabolite profiling (or metabolomic) experiment. This new approach, named SpectConnect, is presented herein along with examples that verify its efficacy and utility in some validation experiments. The beginning of a broader application of SpectConnect's potential is presented as well. The success of SpectConnect, a novel application of Gemoda, validates the utility of a truly generic approach to motif discovery. By not getting bogged down in the specifics of a type of data and a problem unique to that type of data, a broader class of problems can be addressed that otherwise would have been extremely difficult to handle.
by Mark Philip-Walter Styczynski.
Ph.D.

APA, Harvard, Vancouver, ISO, and other styles

17

Scelfo, Tony (Tony W. ). "Data visualization of biological microscopy image analyses." Thesis, Massachusetts Institute of Technology, 2006. http://hdl.handle.net/1721.1/37073.

Full text

Abstract:

Thesis (M. Eng. and S.B.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2006.
Includes bibliographical references.
The Open Microscopy Environment (OME) provides biologists with a framework to store, analyze and manipulate large sets of image data. Current microscopes are capable of generating large numbers of images and when coupled with automated analysis routines, researchers are able to generate intractable sets of data. I have developed an extension to the OME toolkit, named the LoViewer, which allows researchers to quickly identify clusters of images based on relationships between analytically measured parameters. By identifying unique subsets of data, researchers are able to make use of the rest of the OME client software to view interesting images in high resolution, classify them into category groups and apply further analysis routines. The design of the LoViewer itself and its integration with the rest of the OME toolkit will be discussed in detail in body of this thesis.
by Tony Scelfo.
M.Eng.and S.B.

APA, Harvard, Vancouver, ISO, and other styles

18

Becker, Katinka [Verfasser]. "Logical Analysis of Biological Data / Katinka Becker." Berlin : Freie Universität Berlin, 2021. http://d-nb.info/1241541779/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

19

Slotta, Douglas J. "Evalutating Biological Data Using Rank Correlation Methods." Diss., Virginia Tech, 2005. http://hdl.handle.net/10919/27613.

Full text

Abstract:

Analyses based upon rank correlation methods, such as Spearman's Rho and Kendall's Tau, can provide quick insights into large biological data sets. Comparing expression levels between different technologies and models is problematic due to the different units of measure. Here again, rank correlation provides an effective means of comparison between the two techniques. Massively Parallel Signature Sequencing (MPSS) transcript abundance levels to microarray signal intensities for Arabidopsis thaliana are compared. Rank correlations can be applied to subsets as well as the entire set. Results of subset comparisons can be used to improve the capabilities of predictive models, such as Predicted Highly Expressed (PHX). This is done for Escherichia coli. Methods are given to combine predictive models based upon feedback from experimental data. The problem of feature selection in supervised learning situations is also considered, where all features are drawn from a common domain and are best interpreted via ordinal comparisons with other features, rather than as numerical values. This is done for synthetic data as well as for microarray experiments examining the life cycle of Drosophila melanogaster and human leukemia cells. Two novel methods are presented based upon Rho and Tau, and their efficacy is tested with synthetic and real world data. The method based upon Spearman's Rho is shown to be more effective.
Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

20

Anderson, Sarah G. "Statistical Methods for Biological and Relational Data." The Ohio State University, 2013. http://rave.ohiolink.edu/etdc/view?acc_num=osu1365441350.

Full text

APA, Harvard, Vancouver, ISO, and other styles

21

Eren, Kemal. "Application of biclustering algorithms to biological data." The Ohio State University, 2012. http://rave.ohiolink.edu/etdc/view?acc_num=osu1332533492.

Full text

APA, Harvard, Vancouver, ISO, and other styles

22

Mahammad, Beigi Majid. "Kernel methods for high-dimensional biological data." [S.l. : s.n.], 2008.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

23

REHMAN, HAFEEZ UR. "Integration and Analysis of Heterogeneous Biological Data." Doctoral thesis, Politecnico di Torino, 2014. http://hdl.handle.net/11583/2537092.

Full text

Abstract:

We live in the era of networks. The power of networks is the most fundamental driving force behind the machinery of life. Living bodies stay alive through complex inter-regulations of biochemical networks and information flows through these networks with such a great intensity and complexity that it exceeds anything that the human ingenuity has been able to spawn so far. Due to this overwhelming complexity we have begun to see a rapid rise in studies aimed at explaining the fundamental concepts and hidden properties of such complex systems. This thesis provides a strong foundation of using networks to understand complex biological phenomenon like protein functions, as well as more accurate method of modeling gene regulatory networks. In the first part we presented a methodology that uses existing biological data with gene ontology functional dependencies to infer functions of uncharacterized proteins. We combined different sources of structural and functional information along with gene ontology based term-specific relationships to predict precise functions of unannotated proteins. Such term-specific relationships, defined to clearly identify the functional contexts of each activity among the interacting proteins, which enables a dramatical improvement of the annotation accuracy with respect to previous approaches. The presented methodology may be easily extended to integrate more sources of biological information to further improve the function prediction confidence. In the second part of this thesis we discussed an extended BN model to account for post-transcriptional regulation in GRN simulation. Thanks to this extended model, we discussed the set of attractors of two biologically confirmed networks, focusing on the regulatory role of miR-7. Attractors have been compared with networks in which the miRNA was removed. The central role of the miRNA for increasing the network stability has been highlighted in both the networks, confirming the cooperative stabilizing role of miR-7. The enhanced BN model presented in this thesis is only a first step towards a more realistic analysis of the high-level functional and topological characteristics of GRNs. Resorting to the tool facilities, the dynamics of real networks can be analyzed. Thanks to the extended model that includes post-transcriptional regulations, not only the network simulation can be more reliable, but also it can offer new insights on the role of miRNAs from a functional perspective, and this improves the current state-of-the-art, which mostly focuses on high-level gene/gene or gene/protein interactions, neglecting post-transcriptional regulations. Due to its discrete nature, the BN model may still neglect some regulatory fine adjustments. However, the largest number of the computed attractors, now including miRNAs, still represents meaningful states of the network. The simple glimpse into the complexity of the network dynamics, that the toolkit is able to provide, could be used not only as a validation of in vitro experiments, but as a real System Biology tool able to rise new questions and drive new experiments.

APA, Harvard, Vancouver, ISO, and other styles

24

Li, Honghao. "Interpretable biological network reconstruction from observational data." Electronic Thesis or Diss., Université Paris Cité, 2021. http://www.theses.fr/2021UNIP5207.

Full text

Abstract:

Cette thèse porte sur les méthodes basées sur des contraintes. Nous présentons comme exemple l’algorithme PC, pour lequel nous proposons une modification qui garantit la cohérence des ensembles de séparation, utilisés pendant l’étape de reconstruction du squelette pour supprimer les arêtes entre les variables conditionnellement indépendantes, par rapport au graphe final. Elle consiste à itérer l’algorithme d’apprentissage de structure tout en limitant la recherche des ensembles de séparation à ceux qui sont cohérents par rapport au graphe obtenu à la fin de l’itération précédente. La contrainte peut être posée avec une complexité de calcul limitée à l’aide de la décomposition en block-cut tree du squelette du graphe. La modification permet d’augmenter le rappel au prix de la précision des méthodes basées sur des contraintes, tout en conservant une performance globale similaire ou supérieure. Elle améliore également l’interprétabilité et l’explicabilité du modèle graphique obtenu. Nous présentons ensuite la méthode basée sur des contraintes MIIC, récemment développée, qui adopte les idées du cadre du maximum de vraisemblance pour améliorer la robustesse et la performance du graphe obtenu. Nous discutons les caractéristiques et les limites de MIIC, et proposons plusieurs modifications qui mettent l’accent sur l’interprétabilité du graphe obtenu et l’extensibilité de l’algorithme. En particulier, nous mettons en œuvre l’approche itérative pour renforcer la cohérence de l’ensemble de séparation, nous optons pour une règle d’orientation conservatrice et nous utilisons la probabilité d’orientation de MIIC pour étendre la notation des arêtes dans le graphe final afin d’illustrer différentes relations causales. L’algorithme MIIC est appliqué à un ensemble de données d’environ 400 000 dossiers de cancer du sein provenant de la base de données SEER, comme benchmark à grande échelle dans la vie réelle
This thesis is focused on constraint-based methods, one of the basic types of causal structure learning algorithm. We use PC algorithm as a representative, for which we propose a simple and general modification that is applicable to any PC-derived methods. The modification ensures that all separating sets used during the skeleton reconstruction step to remove edges between conditionally independent variables remain consistent with respect to the final graph. It consists in iterating the structure learning algorithm while restricting the search of separating sets to those that are consistent with respect to the graph obtained at the end of the previous iteration. The restriction can be achieved with limited computational complexity with the help of block-cut tree decomposition of the graph skeleton. The enforcement of separating set consistency is found to increase the recall of constraint-based methods at the cost of precision, while keeping similar or better overall performance. It also improves the interpretability and explainability of the obtained graphical model. We then introduce the recently developed constraint-based method MIIC, which adopts ideas from the maximum likelihood framework to improve the robustness and overall performance of the obtained graph. We discuss the characteristics and the limitations of MIIC, and propose several modifications that emphasize the interpretability of the obtained graph and the scalability of the algorithm. In particular, we implement the iterative approach to enforce separating set consistency, and opt for a conservative rule of orientation, and exploit the orientation probability feature of MIIC to extend the edge notation in the final graph to illustrate different causal implications. The MIIC algorithm is applied to a dataset of about 400 000 breast cancer records from the SEER database, as a large-scale real-life benchmark

APA, Harvard, Vancouver, ISO, and other styles

25

PustuÅ‚ka-Hunt, ElzÌ‡bieta Katarzyna. "Biological sequence indexing using persistent Java." Thesis, University of Glasgow, 2001. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.270957.

Full text

APA, Harvard, Vancouver, ISO, and other styles

26

Li, Yehua. "Topics in functional data analysis with biological applications." [College Station, Tex. : Texas A&M University, 2006. http://hdl.handle.net/1969.1/ETD-TAMU-1867.

Full text

APA, Harvard, Vancouver, ISO, and other styles

27

Scholz, Matthias. "Approaches to analyse and interpret biological profile data." Phd thesis, [S.l.] : [s.n.], 2006. http://deposit.ddb.de/cgi-bin/dokserv?idn=980988799.

Full text

APA, Harvard, Vancouver, ISO, and other styles

28

Mathur, Sachin Dinakarpandian Deendayal. "Assessing biological significance of clusters of microarray data." Diss., UMK access, 2004.

Find full text

Abstract:

Thesis (M.S.)--School of Computing and Engineering. University of Missouri--Kansas City, 2004.
"A thesis in computer science." Typescript. Advisor: Deendayal Dinakarpandian. Vita. Title from "catalog record" of the print edition Description based on contents viewed Feb. 27, 2006. Includes bibliographical references (leaves 35-36). Online version of the print edition.

APA, Harvard, Vancouver, ISO, and other styles

29

Iacucci, Ernesto. "Ontological characterization of high through-put biological data." Thesis, McGill University, 2005. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=84102.

Full text

Abstract:

A result of high-throughput experimentation is the demand to summarize and profile results in a meaningful and comparative form. Such experimentation often includes the production of a set of distinguished genes. For example, this distinguished set may correspond to a cluster of co-expressed genes over many conditions or a set of genes from a large scale yeast two-hybrid study. Understanding the biological relevance of this set will encompass annotation of the genes followed by investigation of shared properties found among these annotations. While the set of distinguished genes might have hundreds of annotations associated with them, only a portion of these annotations will represent meaningful aspects associated with the experiment. Identification of the meaningful aspects can be focused by application of a statistic to an annotation resource. One such annotation resource is Gene Ontology (GO), a controlled vocabulary which hierarchically structures annotation terms (classifications) onto which genes can be mapped. Given a distinguished set of genes and a classification, we wish to determine if the number of distinguished genes mapped to that classification is significantly greater or less than would be expected by chance. In estimating these probabilities, researchers have employed the hypergeometric model under differing frameworks. Assumptions made in these frameworks have ignored key issues regarding the mapping of genes to GO and have resulted in inaccurate p-values. Here we show how dynamic programming can be used to compute exact p-values for enrichment or depletion of a particular GO classification. This removes the necessity of approximating the statistics or p-values, as has been the common practice. We apply our methods to a dataset describing labour and compare p-values based on exact and approximate computations of several different statistics for measuring enrichment. We find significant disagreement between commonly employ

APA, Harvard, Vancouver, ISO, and other styles

30

Anastasiadis, Aristoklis. "Neural networks training and applications using biological data." Thesis, Birkbeck (University of London), 2006. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.428055.

Full text

APA, Harvard, Vancouver, ISO, and other styles

31

Yang, L. "Optimisation approaches for data mining in biological systems." Thesis, University College London (University of London), 2016. http://discovery.ucl.ac.uk/1473809/.

Full text

Abstract:

The advances in data acquisition technologies have generated massive amounts of data that present considerable challenge for analysis. How to efficiently and automatically mine through the data and extract the maximum value by identifying the hidden patterns is an active research area, called data mining. This thesis tackles several problems in data mining, including data classification, regression analysis and community detection in complex networks, with considerable applications in various biological systems. First, the problem of data classification is investigated. An existing classifier has been adopted from literature and two novel solution procedures have been proposed, which are shown to improve the predictive accuracy of the original method and significantly reduce the computational time. Disease classification using high throughput genomic data is also addressed. To tackle the problem of analysing large number of genes against small number of samples, a new approach of incorporating extra biological knowledge and constructing higher level composite features for classification has been proposed. A novel model has been introduced to optimise the construction of composite features. Subsequently, regression analysis is considered where two piece-wise linear regression methods have been presented. The first method partitions one feature into multiple complementary intervals and ts each with a distinct linear function. The other method is a more generalised variant of the previous one and performs recursive binary partitioning that permits partitioning of multiple features. Lastly, community detection in complex networks is investigated where a new optimisation framework is introduced to identify the modular structure hidden in directed networks via optimisation of modularity. A non-linear model is firstly proposed before its linearised variant is presented. The optimisation framework consists of two major steps, including solving the non-linear model to identify a coarse initial partition and a second step of solving repeatedly the linearised models to re fine the network partition.

APA, Harvard, Vancouver, ISO, and other styles

32

Shrestha, Anuj. "Association Rule Mining of Biological Field Data Sets." Thesis, North Dakota State University, 2017. https://hdl.handle.net/10365/28394.

Full text

Abstract:

Association rule mining is an important data mining technique, yet, its use in association analysis of biological data sets has been limited. This mining technique was applied on two biological data sets, a genome and a damselfly data set. The raw data sets were pre-processed, and then association analysis was performed with various configurations. The pre-processing task involves minimizing the number of association attributes in genome data and creating the association attributes in damselfly data. The configurations include generation of single/maximal rules and handling single/multiple tier attributes. Both data sets have a binary class label and using association analysis, attributes of importance to each of these class labels are found. The results (rules) from association analysis are then visualized using graph networks by incorporating the association attributes like support and confidence, differential color schemes and features from the pre-processed data.
Bioinformatics Seed Grant Program NIH/UND
National Science Foundation (NSF) Grant IIA-1355466

APA, Harvard, Vancouver, ISO, and other styles

33

Chen, Li. "Searching for significant feature interaction from biological data." Diss., Online access via UMI:, 2007.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

34

Dang, Vinh Q. "Evolutionary approaches for feature selection in biological data." Thesis, Edith Cowan University, Research Online, Perth, Western Australia, 2014. https://ro.ecu.edu.au/theses/1276.

Full text

Abstract:

Data mining techniques have been used widely in many areas such as business, science, engineering and medicine. The techniques allow a vast amount of data to be explored in order to extract useful information from the data. One of the foci in the health area is finding interesting biomarkers from biomedical data. Mass throughput data generated from microarrays and mass spectrometry from biological samples are high dimensional and is small in sample size. Examples include DNA microarray datasets with up to 500,000 genes and mass spectrometry data with 300,000 m/z values. While the availability of such datasets can aid in the development of techniques/drugs to improve diagnosis and treatment of diseases, a major challenge involves its analysis to extract useful and meaningful information. The aims of this project are: 1) to investigate and develop feature selection algorithms that incorporate various evolutionary strategies, 2) using the developed algorithms to find the “most relevant” biomarkers contained in biological datasets and 3) and evaluate the goodness of extracted feature subsets for relevance (examined in terms of existing biomedical domain knowledge and from classification accuracy obtained using different classifiers). The project aims to generate good predictive models for classifying diseased samples from control.

APA, Harvard, Vancouver, ISO, and other styles

35

Causey, Jason L. "Studying Low Complexity Structures in Bioinformatics Data Analysis of Biological and Biomedical Data." Thesis, University of Arkansas at Little Rock, 2018. http://pqdtopen.proquest.com/#viewpdf?dispub=10750808.

Full text

Abstract:

Biological, biomedical, and radiological data tend to be large, complex, and noisy. Gene expression studies contain expression levels for thousands of genes and hundreds or thousands of patients. Chest Computed Tomography images used for diagnosing lung cancer consist of hundreds of 2-D image ”slices”, each containing hundreds of thousands of pixels. Beneath the size and apparent complexity of many of these data are simple and sparse structures. These low complexity structures can be leveraged into new approaches to biological, biomedical, and radiological data analyses. Two examples are presented here. First, a new framework SparRec (Sparse Recovery) for imputation of GWAS data, based on a matrix completion (MC) model taking advantage of the low-rank and low number of co-clusters of GWAS matrices. SparRec is flexible enough to impute meta-analyses with multiple cohorts genotyped on different sets of SNPs, even without a reference panel. Compared with Mendel-Impute, another MC method, our low-rank based method achieves similar accuracy and efficiency even with up to 90% missing data; our co-clustering based method has advantages in running time. MC methods are shown to have advantages over statistics-based methods, including Beagle and fastPhase. Second, we demonstrate NoduleX, a method for predicting lung nodule malignancy from chest Computed Tomography (CT) data, based on deep convolutional neural networks. For training and validation, we analyze >1000 lung nodules in images from the LIDC/IDRI cohort and compare our results with classifications provided by four experienced thoracic radiologists who participated in the LIDC project. NoduleX achieves high accuracy for nodule malignancy classification, with an AUC of up to 0.99, commensurate with the radiologists’ analysis. Whether they are leveraged directly or extracted using mathematical optimization and machine learning techniques, low complexity structures provide researchers with powerful tools for taming complex data.

APA, Harvard, Vancouver, ISO, and other styles

36

Flöter, André. "Analyzing biological expression data based on decision tree induction." [S.l.] : [s.n.], 2006. http://deposit.ddb.de/cgi-bin/dokserv?idn=978444728.

Full text

APA, Harvard, Vancouver, ISO, and other styles

37

Flöter, André. "Analyzing biological expression data based on decision tree induction." Phd thesis, Universität Potsdam, 2005. http://opus.kobv.de/ubp/volltexte/2006/641/.

Full text

Abstract:

Modern biological analysis techniques supply scientists with various forms of data. One category of such data are the so called "expression data". These data indicate the quantities of biochemical compounds present in tissue samples.

Recently, expression data can be generated at a high speed. This leads in turn to amounts of data no longer analysable by classical statistical techniques. Systems biology is the new field that focuses on the modelling of this information.

At present, various methods are used for this purpose. One superordinate class of these methods is machine learning. Methods of this kind had, until recently, predominantly been used for classification and prediction tasks. This neglected a powerful secondary benefit: the ability to induce interpretable models.

Obtaining such models from data has become a key issue within Systems biology. Numerous approaches have been proposed and intensively discussed. This thesis focuses on the examination and exploitation of one basic technique: decision trees.

The concept of comparing sets of decision trees is developed. This method offers the possibility of identifying significant thresholds in continuous or discrete valued attributes through their corresponding set of decision trees. Finding significant thresholds in attributes is a means of identifying states in living organisms. Knowing about states is an invaluable clue to the understanding of dynamic processes in organisms. Applied to metabolite concentration data, the proposed method was able to identify states which were not found with conventional techniques for threshold extraction.

A second approach exploits the structure of sets of decision trees for the discovery of combinatorial dependencies between attributes. Previous work on this issue has focused either on expensive computational methods or the interpretation of single decision trees a very limited exploitation of the data. This has led to incomplete or unstable results. That is why a new method is developed that uses sets of decision trees to overcome these limitations.

Both the introduced methods are available as software tools. They can be applied consecutively or separately. That way they make up a package of analytical tools that usefully supplement existing methods.

By means of these tools, the newly introduced methods were able to confirm existing knowledge and to suggest interesting and new relationships between metabolites.

Neuere biologische Analysetechniken liefern Forschern verschiedenste Arten von Daten. Eine Art dieser Daten sind die so genannten "Expressionsdaten". Sie geben die Konzentrationen biochemischer Inhaltsstoffe in Gewebeproben an.

Neuerdings können Expressionsdaten sehr schnell erzeugt werden. Das führt wiederum zu so großen Datenmengen, dass sie nicht mehr mit klassischen statistischen Verfahren analysiert werden können. "System biology" ist eine neue Disziplin, die sich mit der Modellierung solcher Information befasst.

Zur Zeit werden dazu verschiedenste Methoden benutzt. Eine Superklasse dieser Methoden ist das maschinelle Lernen. Dieses wurde bis vor kurzem ausschließlich zum Klassifizieren und zum Vorhersagen genutzt. Dabei wurde eine wichtige zweite Eigenschaft vernachlässigt, nämlich die Möglichkeit zum Erlernen von interpretierbaren Modellen.

Die Erstellung solcher Modelle hat mittlerweile eine Schlüsselrolle in der "Systems biology" erlangt. Es sind bereits zahlreiche Methoden dazu vorgeschlagen und diskutiert worden. Die vorliegende Arbeit befasst sich mit der Untersuchung und Nutzung einer ganz grundlegenden Technik: den Entscheidungsbäumen.

Zunächst wird ein Konzept zum Vergleich von Baummengen entwickelt, welches das Erkennen bedeutsamer Schwellwerte in reellwertigen Daten anhand ihrer zugehörigen Entscheidungswälder ermöglicht. Das Erkennen solcher Schwellwerte dient dem Verständnis von dynamischen Abläufen in lebenden Organismen. Bei der Anwendung dieser Technik auf metabolische Konzentrationsdaten wurden bereits Zustände erkannt, die nicht mit herkömmlichen Techniken entdeckt werden konnten.

Ein zweiter Ansatz befasst sich mit der Auswertung der Struktur von Entscheidungswäldern zur Entdeckung von kombinatorischen Abhängigkeiten zwischen Attributen. Bisherige Arbeiten hierzu befassten sich vornehmlich mit rechenintensiven Verfahren oder mit einzelnen Entscheidungsbäumen, eine sehr eingeschränkte Ausbeutung der Daten. Das führte dann entweder zu unvollständigen oder instabilen Ergebnissen. Darum wird hier eine Methode entwickelt, die Mengen von Entscheidungsbäumen nutzt, um diese Beschränkungen zu überwinden.

Beide vorgestellten Verfahren gibt es als Werkzeuge für den Computer, die entweder hintereinander oder einzeln verwendet werden können. Auf diese Weise stellen sie eine sinnvolle Ergänzung zu vorhandenen Analyswerkzeugen dar.

Mit Hilfe der bereitgestellten Software war es möglich, bekanntes Wissen zu bestätigen und interessante neue Zusammenhänge im Stoffwechsel von Pflanzen aufzuzeigen.

APA, Harvard, Vancouver, ISO, and other styles

38

Kogelnik, Andreas Matthias. "Biological information management with application to human genome data." Diss., Georgia Institute of Technology, 1998. http://hdl.handle.net/1853/15923.

Full text

APA, Harvard, Vancouver, ISO, and other styles

39

Yu, Yun William. "Compressive algorithms for search and storage in biological data." Thesis, Massachusetts Institute of Technology, 2017. http://hdl.handle.net/1721.1/112879.

Full text

Abstract:

Thesis: Ph. D., Massachusetts Institute of Technology, Department of Mathematics, 2017.
Cataloged from PDF version of thesis.
Includes bibliographical references (pages 187-197).
Disparate biological datasets often exhibit similar well-defined structure; efficient algorithms can be designed to exploit this structure. In this doctoral thesis, we present a framework for similarity search based on entropy and fractal dimension; here, we prove that a clustered search algorithm scales in time with metric entropy number of covering hyperspheres-if the fractal dimension is low. Using these ideas, entropy-scaling versions of standard bioinformatics search tools can be designed, including for small-molecule, metagenomics, and protein structure search. This 'compressive acceleration' approach taking advantage of redundancy and sparsity in biological data can be leveraged also for next-generation sequencing (NGS) read mapping. By pairing together a clustered grouping over similar reads and a homology table for similarities in the human genome, our CORA framework can accelerate all-mapping by several orders of magnitude. Additionally, we also present work on filtering empirical base-calling quality scores from Next Generation Sequencing data. By using the sparsity of k-mers of sufficient length in the human genome and imposing a human prior through the use of frequent k-mers in a large corpus of human DNA reads, we are able to quickly discard over 90% of the information found in those quality scores while retaining or even improving downstream variant-calling accuracy. This filtering step allows for fast lossy compression of quality scores.
by Yun William Yu.
Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

40

Chen, Li. "Integrative Modeling and Analysis of High-throughput Biological Data." Diss., Virginia Tech, 2010. http://hdl.handle.net/10919/30192.

Full text

Abstract:

Computational biology is an interdisciplinary field that focuses on developing mathematical models and algorithms to interpret biological data so as to understand biological problems. With current high-throughput technology development, different types of biological data can be measured in a large scale, which calls for more sophisticated computational methods to analyze and interpret the data. In this dissertation research work, we propose novel methods to integrate, model and analyze multiple biological data, including microarray gene expression data, protein-DNA interaction data and protein-protein interaction data. These methods will help improve our understanding of biological systems. First, we propose a knowledge-guided multi-scale independent component analysis (ICA) method for biomarker identification on time course microarray data. Guided by a knowledge gene pool related to a specific disease under study, the method can determine disease relevant biological components from ICA modes and then identify biologically meaningful markers related to the specific disease. We have applied the proposed method to yeast cell cycle microarray data and Rsf-1-induced ovarian cancer microarray data. The results show that our knowledge-guided ICA approach can extract biologically meaningful regulatory modes and outperform several baseline methods for biomarker identification. Second, we propose a novel method for transcriptional regulatory network identification by integrating gene expression data and protein-DNA binding data. The approach is built upon a multi-level analysis strategy designed for suppressing false positive predictions. With this strategy, a regulatory module becomes increasingly significant as more relevant gene sets are formed at finer levels. At each level, a two-stage support vector regression (SVR) method is utilized to reduce false positive predictions by integrating binding motif information and gene expression data; a significance analysis procedure is followed to assess the significance of each regulatory module. The resulting performance on simulation data and yeast cell cycle data shows that the multi-level SVR approach outperforms other existing methods in the identification of both regulators and their target genes. We have further applied the proposed method to breast cancer cell line data to identify condition-specific regulatory modules associated with estrogen treatment. Experimental results show that our method can identify biologically meaningful regulatory modules related to estrogen signaling and action in breast cancer. Third, we propose a bootstrapping Markov Random Filed (MRF)-based method for subnetwork identification on microarray data by incorporating protein-protein interaction data. Methodologically, an MRF-based network score is first derived by considering the dependency among genes to increase the chance of selecting hub genes. A modified simulated annealing search algorithm is then utilized to find the optimal/suboptimal subnetworks with maximal network score. A bootstrapping scheme is finally implemented to generate confident subnetworks. Experimentally, we have compared the proposed method with other existing methods, and the resulting performance on simulation data shows that the bootstrapping MRF-based method outperforms other methods in identifying ground truth subnetwork and hub genes. We have then applied our method to breast cancer data to identify significant subnetworks associated with drug resistance. The identified subnetworks not only show good reproducibility across different data sets, but indicate several pathways and biological functions potentially associated with the development of breast cancer and drug resistance. In addition, we propose to develop network-constrained support vector machines (SVM) for cancer classification and prediction, by taking into account the network structure to construct classification hyperplanes. The simulation study demonstrates the effectiveness of our proposed method. The study on the real microarray data sets shows that our network-constrained SVM, together with the bootstrapping MRF-based subnetwork identification approach, can achieve better classification performance compared with conventional biomarker selection approaches and SVMs. We believe that the research presented in this dissertation not only provides novel and effective methods to model and analyze different types of biological data, the extensive experiments on several real microarray data sets and results also show the potential to improve the understanding of biological mechanisms related to cancers by generating novel hypotheses for further study.
Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

41

Ha, Sook Shin. "Dimensionality Reduction, Feature Selection and Visualization of Biological Data." Diss., Virginia Tech, 2012. http://hdl.handle.net/10919/77169.

Full text

Abstract:

Due to the high dimensionality of most biological data, it is a difficult task to directly analyze, model and visualize the data to gain biological insight. Thus, dimensionality reduction becomes an imperative pre-processing step in analyzing and visualizing high-dimensional biological data. Two major approaches to dimensionality reduction in genomic analysis and biomarker identification studies are: Feature extraction, creating new features by combining existing ones based on a mapping technique; and feature selection, choosing an optimal subset of all features based on an objective function. In this dissertation, we show how our innovative reduction schemes effectively reduce the dimensionality of DNA gene expression data to extract biologically interpretable and relevant features which result in enhancing the biomarker identification process. To construct biologically interpretable features and facilitate Muscular Dystrophy (MD) subtypes classification, we extract molecular features from MD microarray data by constructing sub-networks using a novel integrative scheme which utilizes protein-protein interaction (PPI) network, functional gene sets information and mRNA profiling data. The workflow includes three major steps: First, by combining PPI network structure and gene-gene co-expression relationship into a new distance metric, we apply affinity propagation clustering (APC) to build gene sub-networks; secondly, we further incorporate functional gene sets knowledge to complement the physical interaction information; finally, based on the constructed sub-network and gene set features, we apply multi-class support vector machine (MSVM) for MD sub-type classification and highlight the biomarkers contributing to the sub-type prediction. The experimental results show that our scheme could construct sub-networks that are more relevant to MD than those constructed by the conventional approach. Furthermore, our integrative strategy substantially improved the prediction accuracy, especially for those â€˜hard-to-classify' sub-types. Conventionally, pathway-based analysis assumes that genes in a pathway equally contribute to a biological function, thus assigning uniform weight to genes. However, this assumption has been proven incorrect and applying uniform weight in the pathway analysis may not be an adequate approach for tasks like molecular classification of diseases, as genes in a functional group may have different differential power. Hence, we propose to use different weights for the pathway analysis which resulted in the development of four weighting schemes. We applied them in two existing pathway analysis methods using both real and simulated gene expression data for pathways. Weighting changes pathway scoring and brings up some new significant pathways, leading to the detection of disease-related genes that are missed under uniform weight. To help us understand our MD expression data better and derive scientific insight from it, we have explored a suite of visualization tools. Particularly, for selected top performing MD sub-networks, we displayed the network view using Cytoscape; functional annotations using IPA and DAVID functional analysis tools; expression pattern using heat-map and parallel coordinates plot; and MD associated pathways using KEGG pathway diagrams. We also performed weighted MD pathway analysis, and identified overlapping sub-networks across different weight schemes and different MD subtypes using Venn Diagrams, which resulted in the identification of a new sub-network significantly associated with MD. All those graphically displayed data and information helped us understand our MD data and the MD subtypes better, resulting in the identification of several potentially MD associated biomarker pathways and genes.
Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

42

Su, Wei. "Motif Mining On Structured And Semi-structured Biological Data." Case Western Reserve University School of Graduate Studies / OhioLINK, 2013. http://rave.ohiolink.edu/etdc/view?acc_num=case1365089538.

Full text

APA, Harvard, Vancouver, ISO, and other styles

43

Guo, Junhai. "Statistical significance and biological relevance of microarray data clustering." Cincinnati, Ohio : University of Cincinnati, 2008. http://rave.ohiolink.edu/etdc/view.cgi?acc_num=ucin1204736862.

Full text

APA, Harvard, Vancouver, ISO, and other styles

44

Ali, S. (Syed). "Employing VLC technology for transmitting data in biological tissue." Master's thesis, University of Oulu, 2019. http://jultika.oulu.fi/Record/nbnfioulu-201905141758.

Full text

Abstract:

Abstract. With the development in wireless communication methods, visible light communication (VLC), a subset of Optical Wireless Communication (OWC) has garnered much attention to employ the technology for a secure short-range wireless communication. We present a feasibility study to determine the performance of VLC in short range wireless transmission of data through biological tissue. VLC is a cost efficient and secure means of transmitting high volume of data wirelessly which can considerably reduce the interference issues caused by electromagnetic pulses and external electric fields. We present a simple measurement approach based on Monte Carlo simulation of photon propagation in tissue to estimate the strength of wireless communication with body implant devices. Using light for communication brings inherent security against unauthorized access of digital data which could be acquired from the low energy body implant devices used for medical diagnosis and other studies. This thesis discusses the typical components required to establish VLC such as, transmitter, receiver and the channel mediums. Furthermore, two cases of Monte Carlo simulation of photon-tissue interaction are studied to determine a possibility if VLC is a suitable substitute to radio frequency (RF) for a more wireless communication with the body implants. The process of theoretical measurement begins with conversion of light intensity into an electrical signal and an estimation of achievable data rate through a complex heterogeneous biological tissue model. The theoretically achieved data rates of the communication were found to be in the order of megabits per second (Mbps), ensuring a possibility to utilize this technology for short range reliable wireless communication with a wider range and application of implant medical devices. Biophotonics.fi presents a computational simulation of light propagation in different types of computational tissue models comprehensively validated by comparison with the team’s practical implementation of the same setup. This simulation is also used in this thesis (5.2.2) to approximate more accurate data rates of communication in case of a practical implementation.

APA, Harvard, Vancouver, ISO, and other styles

45

de, Vito Roberta. "Multi-study factor models for high-dimensional biological data." Doctoral thesis, Università degli studi di Padova, 2016. http://hdl.handle.net/11577/3424398.

Full text

Abstract:

High-throughput assays are transforming the study of biology, and are generating a rich, complex and diverse collection of high-dimensional data sets. Building systematic knowledge from this data is a cumulative process, which requires analyses that integrate multiple sources, studies, and technologies. The increased availability of ensembles of studies on related clinical populations, assaying technologies, and genomic features poses two categories of very important multi-study statistical components: 1) common factors shared across multiple studies; 2) study-specific factors. To capture these two different quantities, in this thesis we propose a novel class of factor analysis models, both under a frequentist and Bayesian approach. In the frequentist approach an ECM algorithm is provided to obtain the maximum likelihood estimates. Moreover, we propose a Bayesian approach to apply the method to settings with more variables than subjects. In modeling dependencies among many variables, a sparse structure underlying the associations among genes is assumed. Both methods allow to perform joint analysis of multiple high-throughput studies. The results are helpful for combining multiple studies, identifying reproducible biology across studies and interesting study-specific components, and removing idiosyncratic variation that lacks cross-study reproducibility.
Le analisi scientifiche su un alto numero di campioni (high-throughput assays) stanno trasformando gli studi biologici. In particolare gli high-throughput assays generano una ricca, complessa e varia collezione di dati a più dimensioni. Estrarre informazioni significative in maniera sistematica da questo tipo di dati richiede un processo progressivo che si basa sull’analisi simultanea di risorse, studi e tecnologie differenti. La crescente disponibilità di numerosi studi clinici su rilevanti gruppi, popolazioni e diversi studi genetici genera due categorie: la prima, una categoria relativa ai fattori condivisi da tutti gli studi ed una seconda, relativa a fattori specifici di ogni studio. Per catturare queste due differenti categorie abbiamo proposto, nell'ambito di tale tesi, una nuova classe di modellizzazione di analisi fattoriale che abbiamo sviluppato in un approccio sia frequentista che Bayesiano. Nell'approccio frequentista, è stato proposto un algoritmo ECM per la stima di massima verosimiglianza dei parametri. Inoltre, in questa tesi, si è proposto un approccio Bayesiano per adattare questo modello ad un contesto di più variabili che soggetti, p>n. Nel modellizzare la dipendenza tra variabili, si è assunta una struttura sparsa per sottolineare le associazioni tra i geni. Entrambi i metodi hanno consentito di modellizzare i diversi studi. Inoltre, i risultati hanno permesso di poter identificare un segnale biologico riproducibile e comune in tutti gli studi, nonché ad eliminare quella parte di varianza che oscura questo segnale.

APA, Harvard, Vancouver, ISO, and other styles

46

You, Chang Hun. "Learning patterns in dynamic graphs with application to biological networks." Pullman, Wash. : Washington State University, 2009. http://www.dissertations.wsu.edu/Dissertations/Summer2009/c_you_072309.pdf.

Full text

Abstract:

Thesis (Ph. D.)--Washington State University, August 2009.
Title from PDF title page (viewed on Aug. 19, 2009). "School of Electrical Engineering and Computer Science." Includes bibliographical references (p. 114-117).

APA, Harvard, Vancouver, ISO, and other styles

47

Rodriguez, Palacios Miguel Andres. "Reversed Voodoo Dolls: An exploration of physical visualizations of biological data." Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-175796.

Full text

Abstract:

Physical visualizations are artifacts that materialize abstract data. They take advantage of human natural abilities to interact with information in the physical world. These visualizations present an opportunity to be applied on new application domains. With the objective of discovering if physical visualizations can support remote monitoring of biological data, a technology probe is presented in the form of a reversed voodoo doll. This probe uses the natural affordance of an anthropomorphic figure to represent a person and reverses the concept of voodoo dolls in a playful way. The scenario of safety is selected for testing physical visualizations of bio-data. Two measurements from the human body, heart rate and motion are chosen as a light way to monitor remotely over a person’s conditions. During the study, a group of six participants were exposed to the technology probe and their interactions with it were observed. The study reports on the users’ interpretations of the data and uses given to the alternative modalities of the probe. The results suggest that the data mapping to the object’s body parts was effective for conveying meaning. Additionally, the results confirm that the use of multiple modalities in physical visualizations offers an opportunity to present information in situated contexts in the real world. The degree of physicality achieved by the reversed voodoo doll and the effects of the selected metaphors are discussed. In conclusion, it is argued that the responses and interpretations from the users indicate that the reversed voodoo doll served as a means in its own right to transmit information for monitoring of bio-data.
Fysiska visualiseringar är artefakter som materialiserar abstrakt data. Genom att använda sig av mänskliga naturliga förmågor interagerar de med information i den fysiska världen. Dessa visualiseringar skapar möjligheter för appliceringar inom nya tillämpningsområden. För att undersöka om fysiska visualiseringar kan stödja fjärrövervakning av biologisk data introducerades en sond i form av en omvänd voodoodocka. Med en människolik figur representerar denna sond en verklig person. På så sätt utnyttjar den naturliga associationer till mänskliga egenskaper och omvänder konceptet vodoodockor på ett lekfullt sätt. De fysiska visualiseringarna av biologisk data testas ur ett säkerhetsperspektiv. Två värden, hjärtfrekvens och rörelse, mäts från en människokropp för att göra det möjligt att övervaka en persons tillstånd på distans. Under studien observeras sex användare då de interagerar med sonden. Studien visar hur användarna tolkar sondens data och hur användningen varierar med avseende på sondens olika modaliteter. Resultaten från denna studie tyder på att datamappningen till sondens kroppsdelar effektivt ökade förståelsen. Dessutom bekräftar resultaten att användning av flera modaliteter i fysiska visualiseringar gör det möjligt att presentera information, anpassat till olika situationer i den verkliga världen. Till vilken grad voodoodockan ger en känsla av kroppslighet samt konsekvenser av de valda metaforerna diskuteras. I slutsatsen hävdas att användarnas svar och tolkningar tyder på att den omvända voodoodockan fungerade som ett medel för att övervaka biologisk data.

APA, Harvard, Vancouver, ISO, and other styles

48

BONOMO, Mariella. "Knowledge Extraction from Biological and Social Graphs." Doctoral thesis, Università degli Studi di Palermo, 2022. https://hdl.handle.net/10447/576508.

Full text

Abstract:

Many problems from real life deal with the generation of enormous, varied, dynamic, and interconnected datasets coming from different and heterogeneous sources. Analysing large volumes of data makes it possible to generate new knowledge useful for making more informed decisions, in business and beyond. From personalising customer communication to streamlining production processes, via flow and emergency management, Big Data Analytics has an impact on all processes. The potential uses of Big Data go much further: two of the largest sources of data are including individual traders’ purchasing history, the use of Biological Networks for disease prediction or the reduction and study of Biological Networks. From a computer science point of view, the networks are graphs with various characteristics specific to the application domain. This PhD Thesis focuses on the proposal of novel knowledge extraction techniques from large graphs, mainly based on Big Data methodologies. Two application contexts are considered and three specific problems have been solved: Social data, for the optimization of advertising campaigns, the comparison of user profiles, and neighborhood analysis. Biological and Medical data, with the final aim of identifying biomarkers for diagnosis, treatment, prognosis, and prevention of diseases.

APA, Harvard, Vancouver, ISO, and other styles

49

Garratt, Jane Annabel. "Morphological data from coccolith images using Fourier power spectra." Thesis, Kingston University, 1992. http://eprints.kingston.ac.uk/20749/.

Full text

Abstract:

A new technique for retrieving morphological data from coccolith images using semi-automatic methods is described. The data are acquired as digital video microscope images, and are analyzed using a Fast Fourier Transform which produces Fourier power spectra reflecting the coccolith morphology. Representative data from these power spectra are used as input to principal component and discriminant function analyses. Scores produced by the discriminant function analysis are plotted to show the intra-generic morphological variation that is identified by the system, and suites of variables that contribute significantly to this separation are determined. The results show that objective, useful morphological information can be retrieved using these techniques. Specimens can be distinguished at both genus and species level, and important variables that contribute to the taxonomic variation in morphology are identified for the first time. Morphological changes affecting all the taxa have been shown to occur during the deposition of the Gault Clay Formation. These changes are especially significant for the Watznaueria genus because this genus was thought to be in evolutionary stasis during the Albian stage. The changes within the genera Watznaueria and Zeugrhabdotus coincide with the junction between the Lower and Upper Gault formations, whilst the genus Prediscosphaera changes during the Upper Gault. The results suggest that evolution takes place in rapid bursts rather than in a gradual manner, and support the hypothesis that punctuated equilibrium is the controlling mode of evolutionary change.

APA, Harvard, Vancouver, ISO, and other styles

50

Zandegiacomo, Cella Alice. "Multiplex network analysis with application to biological high-throughput data." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2016. http://amslaurea.unibo.it/10495/.

Full text

Abstract:

In questa tesi vengono studiate alcune caratteristiche dei network a multiplex; in particolare l'analisi verte sulla quantificazione delle differenze fra i layer del multiplex. Le dissimilarita sono valutate sia osservando le connessioni di singoli nodi in layer diversi, sia stimando le diverse partizioni dei layer. Sono quindi introdotte alcune importanti misure per la caratterizzazione dei multiplex, che vengono poi usate per la costruzione di metodi di community detection . La quantificazione delle differenze tra le partizioni di due layer viene stimata utilizzando una misura di mutua informazione. Viene inoltre approfondito l'uso del test dell'ipergeometrica per la determinazione di nodi sovra-rappresentati in un layer, mostrando l'efficacia del test in funzione della similarita dei layer. Questi metodi per la caratterizzazione delle proprieta dei network a multiplex vengono applicati a dati biologici reali. I dati utilizzati sono stati raccolti dallo studio DILGOM con l'obiettivo di determinare le implicazioni genetiche, trascrittomiche e metaboliche dell'obesita e della sindrome metabolica. Questi dati sono utilizzati dal progetto Mimomics per la determinazione di relazioni fra diverse omiche. Nella tesi sono analizzati i dati metabolici utilizzando un approccio a multiplex network per verificare la presenza di differenze fra le relazioni di composti sanguigni di persone obese e normopeso.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Biological data'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles