Dissertations / Theses: 'Integration of peptidomic data'

1

Suwareh, Ousmane. "Modélisation de la pepsinolyse in vitro en conditions gastriques et inférence de réseaux de filiation de peptides à partir de données de peptidomique." Electronic Thesis or Diss., Rennes, Agrocampus Ouest, 2022. https://tel.archives-ouvertes.fr/tel-04059711.

Full text

Abstract:

Pour faire face aux enjeux démographiques actuels, aux « maladies de civilisation » et à la possible raréfaction des ressources alimentaires, il est impératif d’optimiser l’utilisation effective des aliments et d’adapter leur conception aux besoins spécifiques des différentes populations. Cela demande d’accroître notre compréhension des différentes étapes de la digestion. En particulier, en raison du rôle majeur des protéines dans notre alimentation, leur devenir au cours de la digestion est au coeur des préoccupations. Or, les lois probabilistes qui régissent l’action de la pepsine, première protéase à agir dans le tractus gastro-intestinal, ne sont pas encore clairement identifiées.Dans une première approche s’appuyant sur des données de peptidomique, nous démontrons que l’hydrolyse parla pepsine d’une liaison peptidique dépend de la nature des résidus d’acides aminés dans son large voisinage, mais aussi de variables physicochimiques et de structure décrivant son environnement. Nous proposons dans un second temps, tenant compte de l’environnement physicochimique à l’échelle de séquences peptidiques, un modèle non-paramétrique de l’hydrolyse de ces séquences par la pepsine et un algorithme d’estimation de type Expectation-Maximisation, offrant des perspectives de valorisation des données de peptidomique. Dans cette approche dynamique, nous intégrons les réseaux de filiation des peptides dans la procédure d’estimation, ce qui conduit à un modèle plus parcimonieux et plus pertinent au regard des interprétations biologiques
Addressing the current demographic challenges, “civilization diseases” and the possible depletion of food resources, require optimization of food utilization and adapting their conception to the specific needs of each target population. This requires a better understanding of the different stages of the digestion process. In particular, how proteins are hydrolyzed is a major issue, due to their crucial role in human nutrition. However, the probabilistic laws governing the action of pepsin, the first protease to act in the gastrointestinal tract, are still unclear.In a first approach based on peptidomic data, we demonstrate that the hydrolysis by pepsin of a peptidebond depends on the nature of the amino acid residues in its large neighborhood, but also on physicochemical and structural variables describing its environment. In a second step, and considering the physicochemical environment at the peptide level, we propose a nonparametric model of the hydrolysis by pepsin of these peptides, and an Expectation-Maximization type estimation algorithm, offering novel perspectives for the valorization of peptidomic data. In this dynamic approach, we integrate the peptide kinship network into the estimation procedure, which leads to a more parsimonious model that is also more relevant regarding biological interpretations

APA, Harvard, Vancouver, ISO, and other styles

2

Nadal, Francesch Sergi. "Metadata-driven data integration." Doctoral thesis, Universitat Politècnica de Catalunya, 2019. http://hdl.handle.net/10803/666947.

Full text

Abstract:

Data has an undoubtable impact on society. Storing and processing large amounts of available data is currently one of the key success factors for an organization. Nonetheless, we are recently witnessing a change represented by huge and heterogeneous amounts of data. Indeed, 90% of the data in the world has been generated in the last two years. Thus, in order to carry on these data exploitation tasks, organizations must first perform data integration combining data from multiple sources to yield a unified view over them. Yet, the integration of massive and heterogeneous amounts of data requires revisiting the traditional integration assumptions to cope with the new requirements posed by such data-intensive settings. This PhD thesis aims to provide a novel framework for data integration in the context of data-intensive ecosystems, which entails dealing with vast amounts of heterogeneous data, from multiple sources and in their original format. To this end, we advocate for an integration process consisting of sequential activities governed by a semantic layer, implemented via a shared repository of metadata. From an stewardship perspective, this activities are the deployment of a data integration architecture, followed by the population of such shared metadata. From a data consumption perspective, the activities are virtual and materialized data integration, the former an exploratory task and the latter a consolidation one. Following the proposed framework, we focus on providing contributions to each of the four activities. We begin proposing a software reference architecture for semantic-aware data-intensive systems. Such architecture serves as a blueprint to deploy a stack of systems, its core being the metadata repository. Next, we propose a graph-based metadata model as formalism for metadata management. We focus on supporting schema and data source evolution, a predominant factor on the heterogeneous sources at hand. For virtual integration, we propose query rewriting algorithms that rely on the previously proposed metadata model. We additionally consider semantic heterogeneities in the data sources, which the proposed algorithms are capable of automatically resolving. Finally, the thesis focuses on the materialized integration activity, and to this end, proposes a method to select intermediate results to materialize in data-intensive flows. Overall, the results of this thesis serve as contribution to the field of data integration in contemporary data-intensive ecosystems.
Les dades tenen un impacte indubtable en la societat. La capacitat d’emmagatzemar i processar grans quantitats de dades disponibles és avui en dia un dels factors claus per l’èxit d’una organització. No obstant, avui en dia estem presenciant un canvi representat per grans volums de dades heterogenis. En efecte, el 90% de les dades mundials han sigut generades en els últims dos anys. Per tal de dur a terme aquestes tasques d’explotació de dades, les organitzacions primer han de realitzar una integració de les dades, combinantles a partir de diferents fonts amb l’objectiu de tenir-ne una vista unificada d’elles. Per això, aquest fet requereix reconsiderar les assumpcions tradicionals en integració amb l’objectiu de lidiar amb els requisits imposats per aquests sistemes de tractament massiu de dades. Aquesta tesi doctoral té com a objectiu proporcional un nou marc de treball per a la integració de dades en el context de sistemes de tractament massiu de dades, el qual implica lidiar amb una gran quantitat de dades heterogènies, provinents de múltiples fonts i en el seu format original. Per això, proposem un procés d’integració compost d’una seqüència d’activitats governades per una capa semàntica, la qual és implementada a partir d’un repositori de metadades compartides. Des d’una perspectiva d’administració, aquestes activitats són el desplegament d’una arquitectura d’integració de dades, seguit per la inserció d’aquestes metadades compartides. Des d’una perspectiva de consum de dades, les activitats són la integració virtual i materialització de les dades, la primera sent una tasca exploratòria i la segona una de consolidació. Seguint el marc de treball proposat, ens centrem en proporcionar contribucions a cada una de les quatre activitats. La tesi inicia proposant una arquitectura de referència de software per a sistemes de tractament massiu de dades amb coneixement semàntic. Aquesta arquitectura serveix com a planell per a desplegar un conjunt de sistemes, sent el repositori de metadades al seu nucli. Posteriorment, proposem un model basat en grafs per a la gestió de metadades. Concretament, ens centrem en donar suport a l’evolució d’esquemes i fonts de dades, un dels factors predominants en les fonts de dades heterogènies considerades. Per a l’integració virtual, proposem algorismes de rescriptura de consultes que usen el model de metadades previament proposat. Com a afegitó, considerem heterogeneïtat semàntica en les fonts de dades, les quals els algorismes de rescriptura poden resoldre automàticament. Finalment, la tesi es centra en l’activitat d’integració materialitzada. Per això proposa un mètode per a seleccionar els resultats intermedis a materialitzar un fluxes de tractament intensiu de dades. En general, els resultats d’aquesta tesi serveixen com a contribució al camp d’integració de dades en els ecosistemes de tractament massiu de dades contemporanis
Les données ont un impact indéniable sur la société. Le stockage et le traitement de grandes quantités de données disponibles constituent actuellement l’un des facteurs clés de succès d’une entreprise. Néanmoins, nous assistons récemment à un changement représenté par des quantités de données massives et hétérogènes. En effet, 90% des données dans le monde ont été générées au cours des deux dernières années. Ainsi, pour mener à bien ces tâches d’exploitation des données, les organisations doivent d’abord réaliser une intégration des données en combinant des données provenant de sources multiples pour obtenir une vue unifiée de ces dernières. Cependant, l’intégration de quantités de données massives et hétérogènes nécessite de revoir les hypothèses d’intégration traditionnelles afin de faire face aux nouvelles exigences posées par les systèmes de gestion de données massives. Cette thèse de doctorat a pour objectif de fournir un nouveau cadre pour l’intégration de données dans le contexte d’écosystèmes à forte intensité de données, ce qui implique de traiter de grandes quantités de données hétérogènes, provenant de sources multiples et dans leur format d’origine. À cette fin, nous préconisons un processus d’intégration constitué d’activités séquentielles régies par une couche sémantique, mise en oeuvre via un dépôt partagé de métadonnées. Du point de vue de la gestion, ces activités consistent à déployer une architecture d’intégration de données, suivies de la population de métadonnées partagées. Du point de vue de la consommation de données, les activités sont l’intégration de données virtuelle et matérialisée, la première étant une tâche exploratoire et la seconde, une tâche de consolidation. Conformément au cadre proposé, nous nous attachons à fournir des contributions à chacune des quatre activités. Nous commençons par proposer une architecture logicielle de référence pour les systèmes de gestion de données massives et à connaissance sémantique. Une telle architecture consiste en un schéma directeur pour le déploiement d’une pile de systèmes, le dépôt de métadonnées étant son composant principal. Ensuite, nous proposons un modèle de métadonnées basé sur des graphes comme formalisme pour la gestion des métadonnées. Nous mettons l’accent sur la prise en charge de l’évolution des schémas et des sources de données, facteur prédominant des sources hétérogènes sous-jacentes. Pour l’intégration virtuelle, nous proposons des algorithmes de réécriture de requêtes qui s’appuient sur le modèle de métadonnées proposé précédemment. Nous considérons en outre les hétérogénéités sémantiques dans les sources de données, que les algorithmes proposés sont capables de résoudre automatiquement. Enfin, la thèse se concentre sur l’activité d’intégration matérialisée et propose à cette fin une méthode de sélection de résultats intermédiaires à matérialiser dans des flux des données massives. Dans l’ensemble, les résultats de cette thèse constituent une contribution au domaine de l’intégration des données dans les écosystèmes contemporains de gestion de données massives

APA, Harvard, Vancouver, ISO, and other styles

3

Jakonienė, Vaida. "Integration of biological data /." Linköping : Linköpings universitet, 2006. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-7484.

Full text

APA, Harvard, Vancouver, ISO, and other styles

4

Akeel, Fatmah Y. "Secure data integration systems." Thesis, University of Southampton, 2017. https://eprints.soton.ac.uk/415716/.

Full text

Abstract:

As the web moves increasingly towards publishing data, a significant challenge arises when integrating data from diverse sources that have heterogeneous security and privacy policies and requirements. Data Integration Systems (DIS) are concerned with integrating data from multiple data sources to resolve users' queries. DIS are prone to data leakage threats, e.g. unauthorised disclosure or secondary use of the data, that compromise the data's confidentiality and privacy. We claim that these threats are caused by the failure to implement or correctly employ confidentiality and privacy techniques, and by the failure to consider the trust levels of system entities, from the very start of system development. Data leakage also results from a failure to capture or implement the security policies imposed by the data providers on the collection, processing, and disclosure of personal and sensitive data. This research proposes a novel framework, called SecureDIS, to mitigate data leakage threats in DIS. Unlike existing approaches that secure such systems, SecureDIS helps software engineers to lessen data leakage threats during the early phases of DIS development. It comprises six components that represent a conceptualised DIS architecture: data and data sources, security policies, integration approach, integration location, data consumers, and System Security Management (SSM). Each component contains a set of informal guidelines written in natural language to be used by software engineers who build and design a DIS that handles sensitive and personal data. SecureDIS has undergone two rounds of review by experts to conrm its validity, resulting in the guidelines being evaluated and extended. Two approaches were adopted to ensure that SecureDIS is suitable for software engineers. The first was to formalise the guidelines by modelling a DIS with the SecureDIS security policies using Event-B formal methods. This verified the correctness and consistency of the model. The second approach assessed SecureDIS's applicability to a real data integration project by using a case study. The case study addressed the experts' concerns regarding the ability to apply the proposed guidelines in practice.

APA, Harvard, Vancouver, ISO, and other styles

5

Eberius, Julian. "Query-Time Data Integration." Doctoral thesis, Saechsische Landesbibliothek- Staats- und Universitaetsbibliothek Dresden, 2015. http://nbn-resolving.de/urn:nbn:de:bsz:14-qucosa-191560.

Full text

Abstract:

Today, data is collected in ever increasing scale and variety, opening up enormous potential for new insights and data-centric products. However, in many cases the volume and heterogeneity of new data sources precludes up-front integration using traditional ETL processes and data warehouses. In some cases, it is even unclear if and in what context the collected data will be utilized. Therefore, there is a need for agile methods that defer the effort of integration until the usage context is established. This thesis introduces Query-Time Data Integration as an alternative concept to traditional up-front integration. It aims at enabling users to issue ad-hoc queries on their own data as if all potential other data sources were already integrated, without declaring specific sources and mappings to use. Automated data search and integration methods are then coupled directly with query processing on the available data. The ambiguity and uncertainty introduced through fully automated retrieval and mapping methods is compensated by answering those queries with ranked lists of alternative results. Each result is then based on different data sources or query interpretations, allowing users to pick the result most suitable to their information need. To this end, this thesis makes three main contributions. Firstly, we introduce a novel method for Top-k Entity Augmentation, which is able to construct a top-k list of consistent integration results from a large corpus of heterogeneous data sources. It improves on the state-of-the-art by producing a set of individually consistent, but mutually diverse, set of alternative solutions, while minimizing the number of data sources used. Secondly, based on this novel augmentation method, we introduce the DrillBeyond system, which is able to process Open World SQL queries, i.e., queries referencing arbitrary attributes not defined in the queried database. The original database is then augmented at query time with Web data sources providing those attributes. Its hybrid augmentation/relational query processing enables the use of ad-hoc data search and integration in data analysis queries, and improves both performance and quality when compared to using separate systems for the two tasks. Finally, we studied the management of large-scale dataset corpora such as data lakes or Open Data platforms, which are used as data sources for our augmentation methods. We introduce Publish-time Data Integration as a new technique for data curation systems managing such corpora, which aims at improving the individual reusability of datasets without requiring up-front global integration. This is achieved by automatically generating metadata and format recommendations, allowing publishers to enhance their datasets with minimal effort. Collectively, these three contributions are the foundation of a Query-time Data Integration architecture, that enables ad-hoc data search and integration queries over large heterogeneous dataset collections.

APA, Harvard, Vancouver, ISO, and other styles

6

Jakonienė, Vaida. "Integration of Biological Data." Doctoral thesis, Linköpings universitet, IISLAB - Laboratoriet för intelligenta informationssystem, 2006. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-7484.

Full text

Abstract:

Data integration is an important procedure underlying many research tasks in the life sciences, as often multiple data sources have to be accessed to collect the relevant data. The data sources vary in content, data format, and access methods, which often vastly complicates the data retrieval process. As a result, the task of retrieving data requires a great deal of effort and expertise on the part of the user. To alleviate these difficulties, various information integration systems have been proposed in the area. However, a number of issues remain unsolved and new integration solutions are needed. The work presented in this thesis considers data integration at three different levels. 1) Integration of biological data sources deals with integrating multiple data sources from an information integration system point of view. We study properties of biological data sources and existing integration systems. Based on the study, we formulate requirements for systems integrating biological data sources. Then, we define a query language that supports queries commonly used by biologists. Also, we propose a high-level architecture for an information integration system that meets a selected set of requirements and that supports the specified query language. 2) Integration of ontologies deals with finding overlapping information between ontologies. We develop and evaluate algorithms that use life science literature and take the structure of the ontologies into account. 3) Grouping of biological data entries deals with organizing data entries into groups based on the computation of similarity values between the data entries. We propose a method that covers the main steps and components involved in similarity-based grouping procedures. The applicability of the method is illustrated by a number of test cases. Further, we develop an environment that supports comparison and evaluation of different grouping strategies. The work is supported by the implementation of: 1) a prototype for a system integrating biological data sources, called BioTRIFU, 2) algorithms for ontology alignment, and 3) an environment for evaluating strategies for similarity-based grouping of biological data, called KitEGA.

APA, Harvard, Vancouver, ISO, and other styles

7

Peralta, Veronika. "Data Quality Evaluation in Data Integration Systems." Phd thesis, Université de Versailles-Saint Quentin en Yvelines, 2006. http://tel.archives-ouvertes.fr/tel-00325139.

Full text

Abstract:

Les besoins d'accéder, de façon uniforme, à des sources de données multiples, sont chaque jour plus forts, particulièrement, dans les systèmes décisionnels qui ont besoin d'une analyse compréhensive des données. Avec le développement des Systèmes d'Intégration de Données (SID), la qualité de l'information est devenue une propriété de premier niveau de plus en plus exigée par les utilisateurs. Cette thèse porte sur la qualité des données dans les SID. Nous nous intéressons, plus précisément, aux problèmes de l'évaluation de la qualité des données délivrées aux utilisateurs en réponse à leurs requêtes et de la satisfaction des exigences des utilisateurs en terme de qualité. Nous analysons également l'utilisation de mesures de qualité pour l'amélioration de la conception du SID et de la qualité des données. Notre approche consiste à étudier un facteur de qualité à la fois, en analysant sa relation avec le SID, en proposant des techniques pour son évaluation et en proposant des actions pour son amélioration. Parmi les facteurs de qualité qui ont été proposés, cette thèse analyse deux facteurs de qualité : la fraîcheur et l'exactitude des données. Nous analysons les différentes définitions et mesures qui ont été proposées pour la fraîcheur et l'exactitude des données et nous faisons émerger les propriétés du SID qui ont un impact important sur leur évaluation. Nous résumons l'analyse de chaque facteur par le biais d'une taxonomie, qui sert à comparer les travaux existants et à faire ressortir les problèmes ouverts. Nous proposons un canevas qui modélise les différents éléments liés à l'évaluation de la qualité tels que les sources de données, les requêtes utilisateur, les processus d'intégration du SID, les propriétés du SID, les mesures de qualité et les algorithmes d'évaluation de la qualité. En particulier, nous modélisons les processus d'intégration du SID comme des processus de workflow, dans lesquels les activités réalisent les tâches qui extraient, intègrent et envoient des données aux utilisateurs. Notre support de raisonnement pour l'évaluation de la qualité est un graphe acyclique dirigé, appelé graphe de qualité, qui a la même structure du SID et contient, comme étiquettes, les propriétés du SID qui sont relevants pour l'évaluation de la qualité. Nous développons des algorithmes d'évaluation qui prennent en entrée les valeurs de qualité des données sources et les propriétés du SID, et, combinent ces valeurs pour qualifier les données délivrées par le SID. Ils se basent sur la représentation en forme de graphe et combinent les valeurs des propriétés en traversant le graphe. Les algorithmes d'évaluation peuvent être spécialisés pour tenir compte des propriétés qui influent la qualité dans une application concrète. L'idée derrière le canevas est de définir un contexte flexible qui permet la spécialisation des algorithmes d'évaluation à des scénarios d'application spécifiques. Les valeurs de qualité obtenues pendant l'évaluation sont comparées à celles attendues par les utilisateurs. Des actions d'amélioration peuvent se réaliser si les exigences de qualité ne sont pas satisfaites. Nous suggérons des actions d'amélioration élémentaires qui peuvent être composées pour améliorer la qualité dans un SID concret. Notre approche pour améliorer la fraîcheur des données consiste à l'analyse du SID à différents niveaux d'abstraction, de façon à identifier ses points critiques et cibler l'application d'actions d'amélioration sur ces points-là. Notre approche pour améliorer l'exactitude des données consiste à partitionner les résultats des requêtes en portions (certains attributs, certaines tuples) ayant une exactitude homogène. Cela permet aux applications utilisateur de visualiser seulement les données les plus exactes, de filtrer les données ne satisfaisant pas les exigences d'exactitude ou de visualiser les données par tranche selon leur exactitude. Comparée aux approches existantes de sélection de sources, notre proposition permet de sélectionner les portions les plus exactes au lieu de filtrer des sources entières. Les contributions principales de cette thèse sont : (1) une analyse détaillée des facteurs de qualité fraîcheur et exactitude ; (2) la proposition de techniques et algorithmes pour l'évaluation et l'amélioration de la fraîcheur et l'exactitude des données ; et (3) un prototype d'évaluation de la qualité utilisable dans la conception de SID.

APA, Harvard, Vancouver, ISO, and other styles

8

Peralta, Costabel Veronika del Carmen. "Data quality evaluation in data integration systems." Versailles-St Quentin en Yvelines, 2006. http://www.theses.fr/2006VERS0020.

Full text

Abstract:

This thesis deals with data quality evaluation in Data Integration Systems (DIS). Specifically, we address the problems of evaluating the quality of the data conveyed to users in response to their queries and verifying if users’ quality expectations can be achieved. We also analyze how quality measures can be used for improving the DIS and enforcing data quality. Our approach consists in studying one quality factor at a time, analyzing its impact within a DIS, proposing techniques for its evaluation and proposing improvement actions for its enforcement. Among the quality factors that have been proposed, this thesis analyzes two of the most used ones: data freshness and data accuracy
Cette thèse porte sur la qualité des données dans les Systèmes d’Intégration de Données (SID). Nous nous intéressons, plus précisément, aux problèmes de l’évaluation de la qualité des données délivrées aux utilisateurs en réponse à leurs requêtes et de la satisfaction des exigences des utilisateurs en terme de qualité. Nous analysons également l’utilisation de mesures de qualité pour l’amélioration de la conception du SID et la conséquente amélioration de la qualité des données. Notre approche consiste à étudier un facteur de qualité à la fois, en analysant sa relation avec le SID, en proposant des techniques pour son évaluation et en proposant des actions pour son amélioration. Parmi les facteurs de qualité qui ont été proposés, cette thèse analyse deux facteurs de qualité : la fraîcheur et l’exactitude des données

APA, Harvard, Vancouver, ISO, and other styles

9

Neumaier, Sebastian, Axel Polleres, Simon Steyskal, and Jürgen Umbrich. "Data Integration for Open Data on the Web." Springer International Publishing AG, 2017. http://dx.doi.org/10.1007/978-3-319-61033-7_1.

Full text

Abstract:

In this lecture we will discuss and introduce challenges of integrating openly available Web data and how to solve them. Firstly, while we will address this topic from the viewpoint of Semantic Web research, not all data is readily available as RDF or Linked Data, so we will give an introduction to different data formats prevalent on the Web, namely, standard formats for publishing and exchanging tabular, tree-shaped, and graph data. Secondly, not all Open Data is really completely open, so we will discuss and address issues around licences, terms of usage associated with Open Data, as well as documentation of data provenance. Thirdly, we will discuss issues connected with (meta-)data quality issues associated with Open Data on the Web and how Semantic Web techniques and vocabularies can be used to describe and remedy them. Fourth, we will address issues about searchability and integration of Open Data and discuss in how far semantic search can help to overcome these. We close with briefly summarizing further issues not covered explicitly herein, such as multi-linguality, temporal aspects (archiving, evolution, temporal querying), as well as how/whether OWL and RDFS reasoning on top of integrated open data could be help.

APA, Harvard, Vancouver, ISO, and other styles

10

Cheng, Hui. "Data integration and visualization for systems biology data." Diss., Virginia Tech, 2010. http://hdl.handle.net/10919/77250.

Full text

Abstract:

Systems biology aims to understand cellular behavior in terms of the spatiotemporal interactions among cellular components, such as genes, proteins and metabolites. Comprehensive visualization tools for exploring multivariate data are needed to gain insight into the physiological processes reflected in these molecular profiles. Data fusion methods are required to integratively study high-throughput transcriptomics, metabolomics and proteomics data combined before systems biology can live up to its potential. In this work I explored mathematical and statistical methods and visualization tools to resolve the prominent issues in the nature of systems biology data fusion and to gain insight into these comprehensive data. In order to choose and apply multivariate methods, it is important to know the distribution of the experimental data. Chi square Q-Q plot and violin plot were applied to all M. truncatula data and V. vinifera data, and found most distributions are right-skewed (Chapter 2). The biplot display provides an effective tool for reducing the dimensionality of the systems biological data and displaying the molecules and time points jointly on the same plot. Biplot of M. truncatula data revealed the overall system behavior, including unidentified compounds of interest and the dynamics of the highly responsive molecules (Chapter 3). The phase spectrum computed from the Fast Fourier transform of the time course data has been found to play more important roles than amplitude in the signal reconstruction. Phase spectrum analyses on in silico data created with two artificial biochemical networks, the Claytor model and the AB2 model proved that phase spectrum is indeed an effective tool in system biological data fusion despite the data heterogeneity (Chapter 4). The difference between data integration and data fusion are further discussed. Biplot analysis of scaled data were applied to integrate transcriptome, metabolome and proteome data from the V. vinifera project. Phase spectrum combined with k-means clustering was used in integrative analyses of transcriptome and metabolome of the M. truncatula yeast elicitation data and of transcriptome, metabolome and proteome of V. vinifera salinity stress data. The phase spectrum analysis was compared with the biplot display as effective tools in data fusion (Chapter 5). The results suggest that phase spectrum may perform better than the biplot. This work was funded by the National Science Foundation Plant Genome Program, grant DBI-0109732, and by the Virginia Bioinformatics Institute.
Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

11

Hackl, Peter, and Michaela Denk. "Data Integration: Techniques and Evaluation." Austrian Statistical Society, 2004. http://epub.wu.ac.at/5631/1/435%2D1317%2D1%2DSM.pdf.

Full text

Abstract:

Within the DIECOFIS framework, ec3, the Division of Business Statistics from the Vienna University of Economics and Business Administration and ISTAT worked together to find methods to create a comprehensive database of enterprise data required for taxation microsimulations via integration of existing disparate enterprise data sources. This paper provides an overview of the broad spectrum of investigated methodology (including exact and statistical matching as well as imputation) and related statistical quality indicators, and emphasises the relevance of data integration, especially for official statistics, as a means of using available information more efficiently and improving the quality of a statistical agency's products. Finally, an outlook on an empirical study comparing different exact matching procedures in the maintenance of Statistics Austria's Business Register is presented.

APA, Harvard, Vancouver, ISO, and other styles

12

Kerr, W. Scott. "Data integration using virtual repositories." Thesis, National Library of Canada = Bibliothèque nationale du Canada, 1999. http://www.collectionscanada.ca/obj/s4/f2/dsk1/tape7/PQDD_0005/MQ45950.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

13

Bauckmann, Jana. "Dependency discovery for data integration." Phd thesis, Universität Potsdam, 2013. http://opus.kobv.de/ubp/volltexte/2013/6664/.

Full text

Abstract:

Data integration aims to combine data of different sources and to provide users with a unified view on these data. This task is as challenging as valuable. In this thesis we propose algorithms for dependency discovery to provide necessary information for data integration. We focus on inclusion dependencies (INDs) in general and a special form named conditional inclusion dependencies (CINDs): (i) INDs enable the discovery of structure in a given schema. (ii) INDs and CINDs support the discovery of cross-references or links between schemas. An IND “A in B” simply states that all values of attribute A are included in the set of values of attribute B. We propose an algorithm that discovers all inclusion dependencies in a relational data source. The challenge of this task is the complexity of testing all attribute pairs and further of comparing all of each attribute pair's values. The complexity of existing approaches depends on the number of attribute pairs, while ours depends only on the number of attributes. Thus, our algorithm enables to profile entirely unknown data sources with large schemas by discovering all INDs. Further, we provide an approach to extract foreign keys from the identified INDs. We extend our IND discovery algorithm to also find three special types of INDs: (i) Composite INDs, such as “AB in CD”, (ii) approximate INDs that allow a certain amount of values of A to be not included in B, and (iii) prefix and suffix INDs that represent special cross-references between schemas. Conditional inclusion dependencies are inclusion dependencies with a limited scope defined by conditions over several attributes. Only the matching part of the instance must adhere the dependency. We generalize the definition of CINDs distinguishing covering and completeness conditions and define quality measures for conditions. We propose efficient algorithms that identify covering and completeness conditions conforming to given quality thresholds. The challenge for this task is twofold: (i) Which (and how many) attributes should be used for the conditions? (ii) Which attribute values should be chosen for the conditions? Previous approaches rely on pre-selected condition attributes or can only discover conditions applying to quality thresholds of 100%. Our approaches were motivated by two application domains: data integration in the life sciences and link discovery for linked open data. We show the efficiency and the benefits of our approaches for use cases in these domains.
Datenintegration hat das Ziel, Daten aus unterschiedlichen Quellen zu kombinieren und Nutzern eine einheitliche Sicht auf diese Daten zur Verfügung zu stellen. Diese Aufgabe ist gleichermaßen anspruchsvoll wie wertvoll. In dieser Dissertation werden Algorithmen zum Erkennen von Datenabhängigkeiten vorgestellt, die notwendige Informationen zur Datenintegration liefern. Der Schwerpunkt dieser Arbeit liegt auf Inklusionsabhängigkeiten (inclusion dependency, IND) im Allgemeinen und auf der speziellen Form der Bedingten Inklusionsabhängigkeiten (conditional inclusion dependency, CIND): (i) INDs ermöglichen das Finden von Strukturen in einem gegebenen Schema. (ii) INDs und CINDs unterstützen das Finden von Referenzen zwischen Datenquellen. Eine IND „A in B“ besagt, dass alle Werte des Attributs A in der Menge der Werte des Attributs B enthalten sind. Diese Arbeit liefert einen Algorithmus, der alle INDs in einer relationalen Datenquelle erkennt. Die Herausforderung dieser Aufgabe liegt in der Komplexität alle Attributpaare zu testen und dabei alle Werte dieser Attributpaare zu vergleichen. Die Komplexität bestehender Ansätze ist abhängig von der Anzahl der Attributpaare während der hier vorgestellte Ansatz lediglich von der Anzahl der Attribute abhängt. Damit ermöglicht der vorgestellte Algorithmus unbekannte Datenquellen mit großen Schemata zu untersuchen. Darüber hinaus wird der Algorithmus erweitert, um drei spezielle Formen von INDs zu finden, und ein Ansatz vorgestellt, der Fremdschlüssel aus den erkannten INDs filtert. Bedingte Inklusionsabhängigkeiten (CINDs) sind Inklusionsabhängigkeiten deren Geltungsbereich durch Bedingungen über bestimmten Attributen beschränkt ist. Nur der zutreffende Teil der Instanz muss der Inklusionsabhängigkeit genügen. Die Definition für CINDs wird in der vorliegenden Arbeit generalisiert durch die Unterscheidung von überdeckenden und vollständigen Bedingungen. Ferner werden Qualitätsmaße für Bedingungen definiert. Es werden effiziente Algorithmen vorgestellt, die überdeckende und vollständige Bedingungen mit gegebenen Qualitätsmaßen auffinden. Dabei erfolgt die Auswahl der verwendeten Attribute und Attributkombinationen sowie der Attributwerte automatisch. Bestehende Ansätze beruhen auf einer Vorauswahl von Attributen für die Bedingungen oder erkennen nur Bedingungen mit Schwellwerten von 100% für die Qualitätsmaße. Die Ansätze der vorliegenden Arbeit wurden durch zwei Anwendungsbereiche motiviert: Datenintegration in den Life Sciences und das Erkennen von Links in Linked Open Data. Die Effizienz und der Nutzen der vorgestellten Ansätze werden anhand von Anwendungsfällen in diesen Bereichen aufgezeigt.

APA, Harvard, Vancouver, ISO, and other styles

14

Krisnadhi, Adila Alfa. "Ontology Pattern-Based Data Integration." Wright State University / OhioLINK, 2015. http://rave.ohiolink.edu/etdc/view?acc_num=wright1453177798.

Full text

APA, Harvard, Vancouver, ISO, and other styles

15

Braunschweig, Katrin, Julian Eberius, Maik Thiele, and Wolfgang Lehner. "Frontiers in Crowdsourced Data Integration." De Gruyter, 2012. https://tud.qucosa.de/id/qucosa%3A72850.

Full text

Abstract:

There is an ever-increasing amount and variety of open web data available that is insufficiently examined or not considered at all in decision making processes. This is because of the lack of end-user friendly tools that help to reuse this public data and to create knowledge out of it. Therefore, we propose a schema-optional data repository that provides the flexibility necessary to store and gradually integrate heterogeneous web data. Based on this repository, we propose a semi-automatic schema enrichment approach that efficiently augments the data in a “pay-as-you-go” fashion. Due to the inherently appearing ambiguities we further propose a crowd-based verification component that is able to resolve such conflicts in a scalable manner.
Die stetig wachsende Zahl offen verfügbarer Webdaten findet momentan viel zu wenig oder gar keine Berücksichtigung in Entscheidungsprozessen. Der Grund hierfür ist insbesondere in der mangelnden Unterstützung durch anwenderfreundliche Werkzeuge zu finden, die diese Daten nutzbar machen und Wissen daraus genieren können. Zu diesem Zweck schlagen wir ein schemaoptionales Datenrepositorium vor, welches ermöglicht, heterogene Webdaten zu speichern sowie kontinuierlich zu integrieren und mit Schemainformation anzureichern. Auf Grund der dabei inhärent auftretenden Mehrdeutigkeiten, soll dieser Prozess zusätzlich um eine Crowd-basierende Verifikationskomponente unterstützt werden.

APA, Harvard, Vancouver, ISO, and other styles

16

Nagyová, Barbora. "Data integration in large enterprises." Master's thesis, Vysoká škola ekonomická v Praze, 2015. http://www.nusl.cz/ntk/nusl-203918.

Full text

Abstract:

Data Integration is currently an important and complex topic for many companies, because having a good and working Data Integration solution can bring multiple advantages over competitors. Data Integration is usually being executed in a form of a project, which might easily turn into failure. In order to decrease risks and negative impact of a failed Data Integration project, there needs to be good project management, Data Integration knowledge and the right technology in place. This thesis provides a framework for setting up a good Data Integration solution. The framework is developed based on the current theory, currently available Data Integration tools and opinions provided by experts working in the field for a minimum of 7+ years and have proven their skills with a successful Data Integration project. This thesis does not guarantee the development of the right Data Integration solution, but it does provide guidance how to deal with a Data Integration project in a large enterprise. This thesis is structured into seven chapters. The first chapter brings an overview about this thesis such as scope, goals, assumptions and expected value. The second chapter describes Data Management and basic Data Integration theory in order to distinguish these two topics and to explain the relationship between them. The third chapter is focused purely on Data Integration theory which should be known by everyone who participates in a Data Integration project. The fourth chapter analyses features of the current Data Integration solutions available on the market and provides an overview of the most common and necessary functionalities. Chapter five focuses on the practical part of this thesis, where the Data Integration framework is designed based on findings from previous chapters and interviews with experts in this field. Chapter six then applies the framework to a real working (anonymized) Data Integration solution, highlights the gap between the framework and the solution and provides guidance how to deal with the gaps. Chapter seven provides a resume, personal opinion and outlook.

APA, Harvard, Vancouver, ISO, and other styles

17

Tallur, Gayatri. "Uncertain data integration with probabilities." Thesis, The University of North Carolina at Greensboro, 2014. http://pqdtopen.proquest.com/#viewpdf?dispub=1551297.

Full text

Abstract:

Real world applications that deal with information extraction, such as business intelligence software or sensor data management, must often process data provided with varying degrees of uncertainty. Uncertainty can result from multiple or inconsistent sources, as well as approximate schema mappings. Modeling, managing and integrating uncertain data from multiple sources has been an active area of research in recent years. In particular, data integration systems free the user from the tedious tasks of finding relevant data sources, interacting with each source in isolation using its corresponding interface and combining data from multiple sources by providing a uniform query interface to gain access to the integrated information.

Previous work has integrated uncertain data using representation models such as the possible worlds and probabilistic relations. We extend this work by determining the probabilities of possible worlds of an extended probabilistic relation. We also present an algorithm to determine when a given extended probabilistic relation can be obtained by the integration of two probabilistic relations and give the decomposed pairs of probabilistic relations.

APA, Harvard, Vancouver, ISO, and other styles

18

CALABRIA, ANDREA. "Data integration for clinical genomics." Doctoral thesis, Università degli Studi di Milano-Bicocca, 2011. http://hdl.handle.net/10281/19219.

Full text

Abstract:

Genetics and Molecular Biology are keys for the understanding the mechanisms of many of the human diseases that have strong harmful effects. The empirical mission of Genetics is to translate these mechanisms into Clinical benefits, thus bridging in-silico findings to patient bed side: approaching this goal means achieving what is commonly referred as clinical genomics or personalized medicine. In this process, technologies are assuming an increasing role. With the introduction of new experimental platforms (microarrays, sequencing, etc), today's analyses are much more detailed and can cover a wide spectrum of applications, from gene expression to Copy Number Variants detection. The advantages of technological improvements are usually followed by data management drawbacks due to the explosion of data throughput that reflects on a real need for new systems of data rationalization and management, data access, query and extraction. Our genetic laboratories partners encountered all those issues: what they need is a tool that allows data-integration and supports biological data analysis exploiting computational infrastructures on distributed environment. From such needs, we defined two main goals: (1) Computer Science goal: to design and implement a framework that integrates and manages data and genetic analyses; (2) Genetics and Molecular Biology goal (application domains): to solve biological problems through the framework and develop new methods. Given these requirements and related specifications, we designed an extensible framework based on three inter-connected layers: (1) Experimental data layer, that provides data integration of data from high-throughput platforms (also called horizontal data integration); (2) Knowledge data layer, that provides data integration of knowledge data (also called vertical integration); (3) Computational layer, that provides access to distributed environments for data analysis, in our cases GRID and Cluster technologies. Above the three design blocks, single biological problems can be supported and custom user interfaces are implemented. From our partner laboratories, two main relevant biological problems have been addressed: (1) Linkage Analysis: given a large pedigree in which subjects were genotyped with chips of 1 million of SNPs, the linkage analysis problem presented real computational limits. We designed a heuristic method to overcome computational restrictions and implemented it within our framework, exploiting GRID and Cluster environments. Using our approach, we obtained genetic results, successfully validated by end-users. We also tested performances of the system, reporting compared results. (2) SNP selection and ranking: given the problem of ranking SNPs based on a-priori information, we developed a novel method for biological data mining on genes' annotations. The method has been implemented as a web tool, SNP Ranker, that is under deep validation by our partners laboratories. The framework here designed and implemented demonstrated that this approach is consistent and can have potential impacts on the scientific community.

APA, Harvard, Vancouver, ISO, and other styles

19

Al-Mutairy, Badr. "Data mining and integration of heterogeneous bioinformatics data sources." Thesis, Cardiff University, 2008. http://orca.cf.ac.uk/54178/.

Full text

Abstract:

In this thesis, we have presented a novel approach to interoperability based on the use of biological relationships that have used relationship-based integration to integrate bioinformatics data sources; this refers to the use of different relationship types with different relationship closeness values to link gene expression datasets with other information available in public bioinformatics data sources. These relationships provide flexible linkage for biologists to discover linked data across the biological universe. Relationship closeness is a variable used to measure the closeness of the biological entities in a relationship and is a characteristic of the relationship. The novelty of this approach is that it allows a user to link a gene expression dataset with heterogeneous data sources dynamically and flexibly to facilitate comparative genomics investigations. Our research has demonstrated that using different relationships allows biologists to analyze experimental datasets in different ways, shorten the time needed to analyze the datasets and provide an easier way to undertake this analysis. Thus, it provides more power to biologists to do experimentations using changing threshold values and linkage types. This is achieved in our framework by introducing the Soft Link Model (SLM) and a Relationship Knowledge Base (RKB), which is built and used by SLM. Integration and Data Mining Bioinformatics Data sources system (IDMBD) is implemented as a proof of concept prototype to demonstrate the technique of linkages described in the thesis.

APA, Harvard, Vancouver, ISO, and other styles

20

Fan, Hao. "Investigating a heterogeneous data integration approach for data warehousing." Thesis, Birkbeck (University of London), 2005. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.424299.

Full text

APA, Harvard, Vancouver, ISO, and other styles

21

Serra, Angela. "Multi-view learning and data integration for omics data." Doctoral thesis, Universita degli studi di Salerno, 2017. http://hdl.handle.net/10556/2580.

Full text

Abstract:

2015 - 2016
In recent years, the advancement of high-throughput technologies, combined with the constant decrease of the data-storage costs, has led to the production of large amounts of data from diﬀerent experiments that characterise the same entities of interest. This information may relate to speciﬁc aspects of a phenotypic entity (e.g. Gene expression), or can include the comprehensive and parallel measurement of multiple molecular events (e.g., DNA modiﬁcations, RNA transcription and protein translation) in the same samples. Exploiting such complex and rich data is needed in the frame of systems biology for building global models able to explain complex phenotypes. For example, theuseofgenome-widedataincancerresearch, fortheidentiﬁcationof groups of patients with similar molecular characteristics, has become a standard approach for applications in therapy-response, prognosis-prediction, and drugdevelopment.ÂăMoreover, the integration of gene expression data regarding cell treatment by drugs, and information regarding chemical structure of the drugs allowed scientist to perform more accurate drug repositioning tasks. Unfortunately, there is a big gap between the amount of information and the knowledge in which it is translated. Moreover, there is a huge need of computational methods able to integrate and analyse data to ﬁll this gap. Current researches in this area are following two diﬀerent integrative methods: one uses the complementary information of diﬀerent measurements for the 7 i i “Template” — 2017/6/9 — 16:42 — page 8 — #8 i i i i i i study of complex phenotypes on the same samples (multi-view learning); the other tends to infer knowledge about the phenotype of interest by integrating and comparing the experiments relating to it with respect to those of diﬀerent phenotypes already known through comparative methods (meta-analysis). Meta-analysis can be thought as an integrative study of previous results, usually performed aggregating the summary statistics from diﬀerent studies. Due to its nature, meta-analysis usually involves homogeneous data. On the other hand, multi-view learning is a more ﬂexible approach that considers the fusion of different data sources to get more stable and reliable estimates. Based on the type of data and the stage of integration, new methodologies have been developed spanning a landscape of techniques comprising graph theory, machine learning and statistics. Depending on the nature of the data and on the statistical problem to address, the integration of heterogeneous data can be performed at diﬀerent levels: early, intermediate and late. Early integration consists in concatenating data from diﬀerent views in a single feature space. Intermediate integration consists in transforming all the data sources in a common feature space before combining them. In the late integration methodologies, each view is analysed separately and the results are then combined. The purpose of this thesis is twofold: the former objective is the deﬁnition of a data integration methodology for patient sub-typing (MVDA) and the latter is the development of a tool for phenotypic characterisation of nanomaterials (INSIdEnano). In this PhD thesis, I present the methodologies and the results of my research. MVDA is a multi-view methodology that aims to discover new statistically relevant patient sub-classes. Identify patient subtypes of a speciﬁc diseases is a challenging task especially in the early diagnosis. This is a crucial point for the treatment, because not allthe patients aﬀected bythe same diseasewill have the same prognosis or need the same drug treatment. This problem is usually solved by using transcriptomic data to identify groups of patients that share the same gene patterns. The main idea underlying this research work is that to combine more omics data for the same patients to obtain a better characterisation of their disease proﬁle. The proposed methodology is a late integration approach i i “Template” — 2017/6/9 — 16:42 — page 9 — #9 i i i i i i based on clustering. It works by evaluating the patient clusters in each single view and then combining the clustering results of all the views by factorising the membership matrices in a late integration manner. The eﬀectiveness and the performance of our method was evaluated on six multi-view cancer datasets related to breast cancer, glioblastoma, prostate and ovarian cancer. The omics data used for the experiment are gene and miRNA expression, RNASeq and miRNASeq, Protein Expression and Copy Number Variation. In all the cases, patient sub-classes with statistical signiﬁcance were found, identifying novel sub-groups previously not emphasised in literature. The experiments were also conducted by using prior information, as a new view in the integration process, to obtain higher accuracy in patients’ classiﬁcation. The method outperformed the single view clustering on all the datasets; moreover, it performs better when compared with other multi-view clustering algorithms and, unlike other existing methods, it can quantify the contribution of single views in the results. The method has also shown to be stable when perturbation is applied to the datasets by removing one patient at a time and evaluating the normalized mutual information between all the resulting clusterings. These observations suggest that integration of prior information with genomic features in sub-typing analysis is an eﬀective strategy in identifying disease subgroups. INSIdE nano (Integrated Network of Systems bIology Eﬀects of nanomaterials) is a novel tool for the systematic contextualisation of the eﬀects of engineered nanomaterials (ENMs) in the biomedical context. In the recent years, omics technologies have been increasingly used to thoroughly characterise the ENMs molecular mode of action. It is possible to contextualise the molecular eﬀects of diﬀerent types of perturbations by comparing their patterns of alterations. While this approach has been successfully used for drug repositioning, it is still missing to date a comprehensive contextualisation of the ENM mode of action. The idea behind the tool is to use analytical strategies to contextualise or position the ENM with the respect to relevant phenotypes that have been studied in literature, (such as diseases, drug treatments, and other chemical exposures) by comparing their patterns of molecular alteration. This could greatly increase the knowledge on the ENM molecular eﬀects and in turn i i “Template” — 2017/6/9 — 16:42 — page 10 — #10 i i i i i i contribute to the deﬁnition of relevant pathways of toxicity as well as help in predicting the potential involvement of ENM in pathogenetic events or in novel therapeutic strategies. The main hypothesis is that suggestive patterns of similarity between sets of phenotypes could be an indication of a biological association to be further tested in toxicological or therapeutic frames. Based on the expression signature, associated to each phenotype, the strength of similarity between each pair of perturbations has been evaluated and used to build a large network of phenotypes. To ensure the usability of INSIdE nano, a robust and scalable computational infrastructure has been developed, to scan this large phenotypic network and a web-based eﬀective graphic user interface has been built. Particularly, INSIdE nano was scanned to search for clique sub-networks, quadruplet structures of heterogeneous nodes (a disease, a drug, a chemical and a nanomaterial) completely interconnected by strong patterns of similarity (or anti-similarity). The predictions have been evaluated for a set of known associations between diseases and drugs, based on drug indications in clinical practice, and between diseases and chemical, based on literature-based causal exposure evidence, and focused on the possible involvement of nanomaterials in the most robust cliques. The evaluation of INSIdE nano conﬁrmed that it highlights known disease-drug and disease-chemical connections. Moreover, disease similarities agree with the information based on their clinical features, as well as drugs and chemicals, mirroring their resemblance based on the chemical structure. Altogether, the results suggest that INSIdE nano can also be successfully used to contextualise the molecular eﬀects of ENMs and infer their connections to other better studied phenotypes, speeding up their safety assessment as well as opening new perspectives concerning their usefulness in biomedicine. [edited by author]
L’avanzamento tecnologico delle tecnologie high-throughput, combinato con il costante decremento dei costi di memorizzazione, ha portato alla produzione di grandi quantit`a di dati provenienti da diversi esperimenti che caratterizzano le stesse entit`a di interesse. Queste informazioni possono essere relative a speciﬁci aspetti fenotipici (per esempio l’espressione genica), o possono includere misure globali e parallele di diversi aspetti molecolari (per esempio modiﬁche del DNA, trascrizione dell’RNA e traduzione delle proteine) negli stessi campioni. Analizzare tali dati complessi `e utile nel campo della systems biology per costruire modelli capaci di spiegare fenotipi complessi. Ad esempio, l’uso di dati genome-wide nella ricerca legata al cancro, per l’identiﬁcazione di gruppi di pazienti con caratteristiche molecolari simili, `e diventato un approccio standard per una prognosi precoce piu` accurata e per l’identiﬁcazione di terapie speciﬁche. Inoltre, l’integrazione di dati di espressione genica riguardanti il trattamento di cellule tramite farmaci ha permesso agli scienziati di ottenere accuratezze elevate per il drug repositioning. Purtroppo, esiste un grosso divario tra i dati prodotti, in seguito ai numerosi esperimenti, e l’informazione in cui essi sono tradotti. Quindi la comunit`a scientiﬁca ha una forte necessit`a di metodi computazionali per poter integrare e analizzate tali dati per riempire questo divario. La ricerca nel campo delle analisi multi-view, segue due diversi metodi di analisi integrative: uno usa le informazioni complementari di diverse misure per studiare fenotipi complessi su diversi campioni (multi-view learning); l’altro tende ad inferire conoscenza sul fenotipo di interesse di una entit`a confrontando gli esperimenti ad essi relativi con quelli di altre entit`a fenotipiche gi`a note in letteratura (meta-analisi). La meta-analisi pu`o essere pensata come uno studio comparativo dei risultati identiﬁcati in un particolare esperimento, rispetto a quelli di studi precedenti. A causa della sua natura, la meta-analisi solitamente coinvolge dati omogenei. D’altra parte, il multi-view learning `e un approccio piu` ﬂessibile che considera la fusione di diverse sorgenti di dati per ottenere stime piu` stabili e aﬃdabili. In base al tipo di dati e al livello di integrazione, nuove metodologie sono state sviluppate a partire da tecniche basate sulla teoria dei graﬁ, machine learning e statistica. In base alla natura dei dati e al problema statistico da risolvere, l’integrazione di dati eterogenei pu`o essere eﬀettuata a diversi livelli: early, intermediate e late integration. Le tecniche di early integration consistono nella concatenazione dei dati delle diverse viste in un unico spazio delle feature. Le tecniche di intermediate integration consistono nella trasformazione di tutte le sorgenti dati in un unico spazio comune prima di combinarle. Nelle tecniche di late integration, ogni vista `e analizzata separatamente e i risultati sono poi combinati. Lo scopo di questa tesi `e duplice: il primo obbiettivo `e la deﬁnizione di una metodologia di integrazione dati per la sotto-tipizzazione dei pazienti (MVDA) e il secondo `e lo sviluppo di un tool per la caratterizzazione fenotipica dei nanomateriali (INSIdEnano). In questa tesi di dottorato presento le metodologie e i risultati della mia ricerca. MVDA `e una tecnica multi-view con lo scopo di scoprire nuove sotto tipologie di pazienti statisticamente rilevanti. Identiﬁcare sottotipi di pazienti per una malattia speciﬁca `e un obbiettivo con alto rilievo nella pratica clinica, soprattutto per la diagnosi precoce delle malattie. Questo problema `e generalmente risolto usando dati di trascrittomica per identiﬁcare i gruppi di pazienti che condividono gli stessi pattern di alterazione genica. L’idea principale alla base di questo lavoro di ricerca `e quello di combinare piu` tipologie di dati omici per gli stessi pazienti per ottenere una migliore caratterizzazione del loro proﬁlo. La metodologia proposta `e un approccio di tipo late integration basato sul clustering. Per ogni vista viene eﬀettuato il clustering dei pazienti rappresentato sotto forma di matrici di membership. I risultati di tutte le viste vengono poi combinati tramite una tecnica di fattorizzazione di matrici per ottenere i metacluster ﬁnali multi-view. La fattibilit`a e le performance del nostro metodo sono stati valutati su sei dataset multi-view relativi al tumore al seno, glioblastoma, cancro alla prostata e alle ovarie. I dati omici usati per gli esperimenti sono relativi alla espressione dei geni, espressione dei mirna, RNASeq, miRNASeq, espressione delle proteine e della Copy Number Variation. In tutti i dataset sono state identiﬁcate sotto-tipologie di pazienti con rilevanza statistica, identiﬁcando nuovi sottogruppi precedentemente non noti in letteratura. Ulteriori esperimenti sono stati condotti utilizzando la conoscenza a priori relativa alle macro classi dei pazienti. Tale informazione `e stata considerata come una ulteriore vista nel processo di integrazione per ottenere una accuratezza piu` elevata nella classiﬁcazione dei pazienti. Il metodo proposto ha performance migliori degli algoritmi di clustering clussici su tutti i dataset. MVDA ha ottenuto risultati migliori in confronto a altri algoritmi di integrazione di tipo ealry e intermediate integration. Inoltre il metodo `e in grado di calcolare il contributo di ogni singola vista al risultato ﬁnale. I risultati mostrano, anche, che il metodo `e stabile in caso di perturbazioni del dataset eﬀettuate rimuovendo un paziente alla volta (leave-one-out). Queste osservazioni suggeriscono che l’integrazione di informazioni a priori e feature genomiche, da utilizzare congiuntamente durante l’analisi, `e una strategia vincente nell’identiﬁcazione di sotto-tipologie di malattie. INSIdE nano (Integrated Network of Systems bIology Eﬀects of nanomaterials) `e un tool innovativo per la contestualizzazione sistematica degli eﬀetti delle nanoparticelle (ENMs) in contesti biomedici. Negli ultimi anni, le tecnologie omiche sono state ampiamente applicate per caratterizzare i nanomateriali a livello molecolare. E’ possibile contestualizzare l’eﬀetto a livello molecolare di diversi tipi di perturbazioni confrontando i loro pattern di alterazione genica. Mentre tale approccio `e stato applicato con successo nel campo del drug repositioning, una contestualizzazione estensiva dell’eﬀetto dei nanomateriali sulle cellule `e attualmente mancante. L’idea alla base del tool `e quello di usare strategie comparative di analisi per contestualizzare o posizionare i nanomateriali in confronto a fenotipi rilevanti che sono stati studiati in letteratura (come ad esempio malattie dell’uomo, trattamenti farmacologici o esposizioni a sostanze chimiche) confrontando i loro pattern di alterazione molecolare. Questo potrebbe incrementare la conoscenza dell’eﬀetto molecolare dei nanomateriali e contribuire alla deﬁnizione di nuovi pathway tossicologici oppure identiﬁcare eventuali coinvolgimenti dei nanomateriali in eventi patologici o in nuove strategie terapeutiche. L’ipotesi alla base `e che l’identiﬁcazione di pattern di similarit`a tra insiemi di fenotipi potrebbe essere una indicazione di una associazione biologica che deve essere successivamente testata in ambito tossicologico o terapeutico. Basandosi sulla ﬁrma di espressione genica, associata ad ogni fenotipo, la similarit`a tra ogni coppia di perturbazioni `e stata valuta e usata per costruire una grande network di interazione tra fenotipi. Per assicurare l’utilizzo di INSIdE nano, `e stata sviluppata una infrastruttura computazionale robusta e scalabile, allo scopo di analizzare tale network. Inoltre `e stato realizzato un sito web che permettesse agli utenti di interrogare e visualizzare la network in modo semplice ed eﬃciente. In particolare, INSIdE nano `e stato analizzato cercando tutte le possibili clique di quattro elementi eterogenei (un nanomateriale, un farmaco, una malattia e una sostanza chimica). Una clique `e una sotto network completamente connessa, dove ogni elemento `e collegato con tutti gli altri. Di tutte le clique, sono state considerate come signiﬁcative solo quelle per le quali le associazioni tra farmaco e malattia e farmaco e sostanze chimiche sono note. Le connessioni note tra farmaci e malattie si basano sul fatto che il farmaco `e prescritto per curare tale malattia. Le connessioni note tra malattia e sostanze chimiche si basano su evidenze presenti in letteratura del fatto che tali sostanze causano la malattia. Il focus `e stato posto sul possibile coinvolgimento dei nanomateriali con le malattie presenti in tali clique. La valutazione di INSIdE nano ha confermato che esso mette in evidenza connessioni note tra malattie e farmaci e tra malattie e sostanze chimiche. Inoltre la similarit`a tra le malattie calcolata in base ai geni `e conforme alle informazioni basate sulle loro informazioni cliniche. Allo stesso modo le similarit`a tra farmaci e sostanze chimiche rispecchiano le loro similarit`a basate sulla struttura chimica. Nell’insieme, i risultati suggeriscono che INSIdE nano pu`o essere usato per contestualizzare l’eﬀetto molecolare dei nanomateriali e inferirne le connessioni rispetto a fenotipi precedentemente studiati in letteratura. Questo metodo permette di velocizzare il processo di valutazione della loro tossicit`a e apre nuove prospettive per il loro utilizzo nella biomedicina. [a cura dell'autore]
XV n.s.

APA, Harvard, Vancouver, ISO, and other styles

22

Sutiwaraphun, Janjao. "Knowledge integration in distributed data mining." Thesis, Imperial College London, 2001. http://hdl.handle.net/10044/1/8924.

Full text

APA, Harvard, Vancouver, ISO, and other styles

23

Williams, Dean Ashley. "Combining data integration and information extraction." Thesis, Birkbeck (University of London), 2008. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.499152.

Full text

APA, Harvard, Vancouver, ISO, and other styles

24

Morris, Christopher Robert. "Data integration in the rail domain." Thesis, University of Birmingham, 2018. http://etheses.bham.ac.uk//id/eprint/8204/.

Full text

Abstract:

The exchange of information is crucial to the operation of railways; starting with the distribution of timetables, information must constantly be exchanged in any railway network. The slow evolution of the information environment within the rail industry has resulted in the existence of a diverse range of systems, only able to exchange information essential to railway operations. Were the cost of data integration reduced, then further cost reductions and improvements to customer service would follow as barriers to the adoption of other technologies are removed. The need for data integration has already been studied extensively and has been included in the UK industry's rail technical strategy however, despite it's identification as a key technique for improving integration, uptake of ontology remains limited. This thesis considers techniques to reduce barriers to the take up of ontology in the UK rail industry, and presents a case study in which these techniques are applied. Amongst the key barriers to uptake identified are a lack of software engineers with ontology experience, and the diverse information environment within the rail domain. Techniques to overcomes these barriers using software based tools are considered, and example tools produced which aid the overcoming of these barriers. The case study presented is of a degraded mode signalling system, drawing data from a range of diverse sources, integrated using an ontology. Tools created to improve data integration are employed in this commercial project, successfully combing signalling data with (simulated) train positioning data.

APA, Harvard, Vancouver, ISO, and other styles

25

Silva, Luís António Bastião. "Federated architecture for biomedical data integration." Doctoral thesis, Universidade de Aveiro, 2015. http://hdl.handle.net/10773/15759.

Full text

Abstract:

Doutoramento Ciências da Computação
The last decades have been characterized by a continuous adoption of IT solutions in the healthcare sector, which resulted in the proliferation of tremendous amounts of data over heterogeneous systems. Distinct data types are currently generated, manipulated, and stored, in the several institutions where patients are treated. The data sharing and an integrated access to this information will allow extracting relevant knowledge that can lead to better diagnostics and treatments. This thesis proposes new integration models for gathering information and extracting knowledge from multiple and heterogeneous biomedical sources. The scenario complexity led us to split the integration problem according to the data type and to the usage specificity. The first contribution is a cloud-based architecture for exchanging medical imaging services. It offers a simplified registration mechanism for providers and services, promotes remote data access, and facilitates the integration of distributed data sources. Moreover, it is compliant with international standards, ensuring the platform interoperability with current medical imaging devices. The second proposal is a sensor-based architecture for integration of electronic health records. It follows a federated integration model and aims to provide a scalable solution to search and retrieve data from multiple information systems. The last contribution is an open architecture for gathering patient-level data from disperse and heterogeneous databases. All the proposed solutions were deployed and validated in real world use cases.
A adoção sucessiva das tecnologias de comunicação e de informação na área da saúde tem permitido um aumento na diversidade e na qualidade dos serviços prestados, mas, ao mesmo tempo, tem gerado uma enorme quantidade de dados, cujo valor científico está ainda por explorar. A partilha e o acesso integrado a esta informação poderá permitir a identificação de novas descobertas que possam conduzir a melhores diagnósticos e a melhores tratamentos clínicos. Esta tese propõe novos modelos de integração e de exploração de dados com vista à extração de conhecimento biomédico a partir de múltiplas fontes de dados. A primeira contribuição é uma arquitetura baseada em nuvem para partilha de serviços de imagem médica. Esta solução oferece um mecanismo de registo simplificado para fornecedores e serviços, permitindo o acesso remoto e facilitando a integração de diferentes fontes de dados. A segunda proposta é uma arquitetura baseada em sensores para integração de registos electrónicos de pacientes. Esta estratégia segue um modelo de integração federado e tem como objetivo fornecer uma solução escalável que permita a pesquisa em múltiplos sistemas de informação. Finalmente, o terceiro contributo é um sistema aberto para disponibilizar dados de pacientes num contexto europeu. Todas as soluções foram implementadas e validadas em cenários reais.

APA, Harvard, Vancouver, ISO, and other styles

26

Sernadela, Pedro Miguel Lopes. "Data integration services for biomedical applications." Doctoral thesis, Universidade de Aveiro, 2018. http://hdl.handle.net/10773/23511.

Full text

Abstract:

Doutoramento em Informática (MAP-i)
In the last decades, the field of biomedical science has fostered unprecedented scientific advances. Research is stimulated by the constant evolution of information technology, delivering novel and diverse bioinformatics tools. Nevertheless, the proliferation of new and disconnected solutions has resulted in massive amounts of resources spread over heterogeneous and distributed platforms. Distinct data types and formats are generated and stored in miscellaneous repositories posing data interoperability challenges and delays in discoveries. Data sharing and integrated access to these resources are key features for successful knowledge extraction. In this context, this thesis makes contributions towards accelerating the semantic integration, linkage and reuse of biomedical resources. The first contribution addresses the connection of distributed and heterogeneous registries. The proposed methodology creates a holistic view over the different registries, supporting semantic data representation, integrated access and querying. The second contribution addresses the integration of heterogeneous information across scientific research, aiming to enable adequate data-sharing services. The third contribution presents a modular architecture to support the extraction and integration of textual information, enabling the full exploitation of curated data. The last contribution lies in providing a platform to accelerate the deployment of enhanced semantic information systems. All the proposed solutions were deployed and validated in the scope of rare diseases.
Nas últimas décadas, o campo das ciências biomédicas proporcionou grandes avanços científicos estimulados pela constante evolução das tecnologias de informação. A criação de diversas ferramentas na área da bioinformática e a falta de integração entre novas soluções resultou em enormes quantidades de dados distribuídos por diferentes plataformas. Dados de diferentes tipos e formatos são gerados e armazenados em vários repositórios, o que origina problemas de interoperabilidade e atrasa a investigação. A partilha de informação e o acesso integrado a esses recursos são características fundamentais para a extração bem sucedida do conhecimento científico. Nesta medida, esta tese fornece contribuições para acelerar a integração, ligação e reutilização semântica de dados biomédicos. A primeira contribuição aborda a interconexão de registos distribuídos e heterogéneos. A metodologia proposta cria uma visão holística sobre os diferentes registos, suportando a representação semântica de dados e o acesso integrado. A segunda contribuição aborda a integração de diversos dados para investigações científicas, com o objetivo de suportar serviços interoperáveis para a partilha de informação. O terceiro contributo apresenta uma arquitetura modular que apoia a extração e integração de informações textuais, permitindo a exploração destes dados. A última contribuição consiste numa plataforma web para acelerar a criação de sistemas de informação semânticos. Todas as soluções propostas foram validadas no âmbito das doenças raras.

APA, Harvard, Vancouver, ISO, and other styles

27

Mishra, Alok. "Data integration for regulatory module discovery." Thesis, Imperial College London, 2012. http://hdl.handle.net/10044/1/10615.

Full text

Abstract:

Genomic data relating to the functioning of individual genes and their products are rapidly being produced using many different and diverse experimental techniques. Each piece of data provides information on a specific aspect of the cell regulation process. Integration of these diverse types of data is essential in order to identify biologically relevant regulatory modules. In this thesis, we address this challenge by analyzing the nature of these datasets and propose new techniques of data integration. Since microarray data is not available in quantities that are required for valid inference, many researchers have taken the blind integrative approach where data from diverse microarray experiments are merged. In order to understand the validity of this approach, we start this thesis with studying the heterogeneity of microarray datasets. We have used KL divergence between individual dataset distributions as well as an empirical technique proposed by us to calculate functional similarity between the datasets. Our results indicate that we should not use a blind integration of datasets and much care should be taken to ensure that we mix only similar types of data. We should also be careful about the choice of normalization method. Next, we propose a semi-supervised spectral clustering method which integrates two diverse types of data for the task of gene regulatory module discovery. The technique uses constraints derived from DNA-binding, PPI and TF-gene interactions datasets to guide the clustering (spectral) of microarray experiments. Our results on yeast stress and cell-cycle microarray data indicate that the integration leads to more biologically significant results. Finally, we propose a technique that integrates datasets under the principle of maximum entropy. We argue that this is the most valid approach in an unsupervised setting where we have no other evidence regarding the weights to be assigned to individual datasets. Our experiments with yeast microarray, PPI, DNA-binding and TF-gene interactions datasets show improved biological significance of results.

APA, Harvard, Vancouver, ISO, and other styles

28

Adai, Alex Tamas. "Uncovering microRNA function through data integration." Diss., Search in ProQuest Dissertations & Theses. UC Only, 2008. http://gateway.proquest.com/openurl?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation&res_dat=xri:pqdiss&rft_dat=xri:pqdiss:3311333.

Full text

APA, Harvard, Vancouver, ISO, and other styles

29

Friedman, Marc T. "Representation and optimization for data integration /." Thesis, Connect to this title online; UW restricted, 1999. http://hdl.handle.net/1773/6979.

Full text

APA, Harvard, Vancouver, ISO, and other styles

30

Ives, Zachary G. "Efficient query processing for data integration /." Thesis, Connect to this title online; UW restricted, 2002. http://hdl.handle.net/1773/6864.

Full text

APA, Harvard, Vancouver, ISO, and other styles

31

Ayyad, Majed. "Real-Time Event Centric Data Integration." Doctoral thesis, Università degli studi di Trento, 2014. https://hdl.handle.net/11572/367750.

Full text

Abstract:

A vital step in integrating data from multiple sources is detecting and handling duplicate records that refer to the same real-life entity. Events are spatio-temporal entities that reflect changes in real world and are received or captured from different sources (sensors, mobile phones, social network services, etc.). In many real world situations, detecting events mostly take place through multiple observations by different observers. The local view of the observer reflects only a partial knowledge with certain granularity of time and space. Observations occur at a particular place and time, however events which are inferred from observations, range over time and space. In this thesis, we address the problem of event matching, which is the task of detecting similar events in the recent past from their observations. We focus on detecting Hyperlocal events, which are an integral part of any dynamic human decision-making process and are useful for different multi-tier responding agencies such as emergency medical services, public safety and law enforcement agencies, organizations working on fusing news from different sources as well as for citizens. In an environment where continuous monitoring and processing is required, the matching task imposes different challenges. In particular, the matching task is decomposed into four separate tasks in which each requiring different computational method. The four tasks are: event-type similarity, similarity in location, similarity in time and thematic role similarity that handles participants similarity. We refer to the four tasks as local similarities. Then in addition, a global similarity measure combines the four tasks before being able to cluster and handle them in a robust near real-time system. We address the local similarity by studying thoroughly existing similarity measures and propose suitable similarity for each task. We utilize ideas from semantic web, qualitative spatial reasoning, fuzzy set and structural alignment similarities in order to define local similarity measures. Then we address the global similarity by treating the problem as a relational learning problem and use machine learning to learn the weights of each local similarity. To learn the weights, we combine the features of each pair of events into one object and use logistic regression and support vector machines to learn the weights. The learned weighted function is tested and evaluated on real dataset which is used to predict the similarity class of the new streamed event.

APA, Harvard, Vancouver, ISO, and other styles

32

Ayyad, Majed. "Real-Time Event Centric Data Integration." Doctoral thesis, University of Trento, 2014. http://eprints-phd.biblio.unitn.it/1353/1/REAL-TIME_EVENT_CENTRIC_DATA_INTEGRATION.pdf.

Full text

Abstract:

A vital step in integrating data from multiple sources is detecting and handling duplicate records that refer to the same real-life entity. Events are spatio-temporal entities that reflect changes in real world and are received or captured from different sources (sensors, mobile phones, social network services, etc.). In many real world situations, detecting events mostly take place through multiple observations by different observers. The local view of the observer reflects only a partial knowledge with certain granularity of time and space. Observations occur at a particular place and time, however events which are inferred from observations, range over time and space. In this thesis, we address the problem of event matching, which is the task of detecting similar events in the recent past from their observations. We focus on detecting Hyperlocal events, which are an integral part of any dynamic human decision-making process and are useful for different multi-tier responding agencies such as emergency medical services, public safety and law enforcement agencies, organizations working on fusing news from different sources as well as for citizens. In an environment where continuous monitoring and processing is required, the matching task imposes different challenges. In particular, the matching task is decomposed into four separate tasks in which each requiring different computational method. The four tasks are: event-type similarity, similarity in location, similarity in time and thematic role similarity that handles participants similarity. We refer to the four tasks as local similarities. Then in addition, a global similarity measure combines the four tasks before being able to cluster and handle them in a robust near real-time system. We address the local similarity by studying thoroughly existing similarity measures and propose suitable similarity for each task. We utilize ideas from semantic web, qualitative spatial reasoning, fuzzy set and structural alignment similarities in order to define local similarity measures. Then we address the global similarity by treating the problem as a relational learning problem and use machine learning to learn the weights of each local similarity. To learn the weights, we combine the features of each pair of events into one object and use logistic regression and support vector machines to learn the weights. The learned weighted function is tested and evaluated on real dataset which is used to predict the similarity class of the new streamed event

APA, Harvard, Vancouver, ISO, and other styles

33

Tous, Ruben. "Data integration with XML and semantic web technologies novel approaches in the design of modern data integration systems." Saarbrücken VDM Verlag Dr. Müller, 2006. http://d-nb.info/991303105/04.

Full text

APA, Harvard, Vancouver, ISO, and other styles

34

Sonmez, Sunercan Hatice Kevser. "Data Integration Over Horizontally Partitioned Databases In Service-oriented Data Grids." Master's thesis, METU, 2010. http://etd.lib.metu.edu.tr/upload/12612414/index.pdf.

Full text

Abstract:

Information integration over distributed and heterogeneous resources has been challenging in many terms: coping with various kinds of heterogeneity including data model, platform, access interfaces
coping with various forms of data distribution and maintenance policies, scalability, performance, security and trust, reliability and resilience, legal issues etc. It is obvious that each of these dimensions deserves a separate thread of research efforts. One particular challenge among the ones listed above that is more relevant to the work presented in this thesis is coping with various forms of data distribution and maintenance policies. This thesis aims to provide a service-oriented data integration solution over data Grids for cases where distributed data sources are partitioned with overlapping sections of various proportions. This is an interesting variation which combines both replicated and partitioned data within the same data management framework. Thus, the data management infrastructure has to deal with specific challenges regarding the identification, access and aggregation of partitioned data with varying proportions of overlapping sections. To provide a solution we have extended OGSA-DAI DQP, a well-known service-oriented data access and integration middleware with distributed query processing facilities, by incorporating UnionPartitions operator into its algebra in order to cope with various unusual forms of horizontally partitioned databases. As a result
our solution extends OGSA-DAI DQP, in two points
1 - A new operator type is added to the algebra to perform a specialized union of the partitions with different characteristics, 2 - OGSA-DAI DQP Federation Description is extended to include some more metadata to facilitate the successful execution of the newly introduced operator.

APA, Harvard, Vancouver, ISO, and other styles

35

Wegener, Dennis [Verfasser]. "Integration of Data Mining into Scientific Data Analysis Processes / Dennis Wegener." Bonn : Universitäts- und Landesbibliothek Bonn, 2012. http://d-nb.info/1044867434/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

36

Repchevskiy, Dmitry. "Ontology based data integration in life sciences." Doctoral thesis, Universitat de Barcelona, 2016. http://hdl.handle.net/10803/386411.

Full text

Abstract:

The aim of this thesis is to develop standard and practical approaches for the semantic integration of biological data and services. The thesis considers various scenarios where ontologies may benefit bioinformatics web services development, integration and provenance. In spite of the broad use of ontologies in biology, their usage is usually limited to a definition of taxonomic hierarchies. This thesis examines the utility of ontologies for data integration in context of semantic web services development. The biological datatypes ontologies are very valuable for the data integration, especially in a context of continuous standards changes. The thesis evaluates the outdated BioMoby ontology for the generation of modern WS-I and RESTful web services. Another important aspect is the use of ontologies for the web services description. The thesis evaluates the W3C standard WSDL ontology for bioinformatics web services description and provenance. Finally, the integration with modern workflow execution platforms such as Taverna and Galaxy is also considered. Despite the growing popularity of JSON format, web services vastly depend on XML type system. The OWL2XS tool facilitates semantic web services development providing the automatic XML Schema generation from an appropriate OWL 2 datatype ontology. Web services integration is hardly achievable without a broad standard adoption. The BioNemus application automatically generates standard-based web services from BioMoby ontologies. Semantic representation of web services description simplifies web services search and annotation. Semantic Web Services Registry (BioSWR) is based on W3C WSDL ontology and provides a multifaceted web services view in different formats: OWL 2, WSDL 1.1, WSDL 2.0 and WADL. To demonstrate benefits of ontology-based web services descriptions, BioSWR Taverna OSGI plug-in has been developed. The new, experimental, Taverna WSDL generic library has been used in Galaxy Gears tool which allows integrating web services into the Galaxy workflows. The thesis explores the scopes of ontologies application for the biological data and services integration, providing a broad set of original tools.
El objetivo de la tesis es el desarrollo de una solución práctica y estándar para la integración semántica de los datos y servicios biológicos. La tesis estudia escenarios diferentes en los cuales las ontologías pueden beneficiar el desarrollo de los servicios web, su búsqueda y su visibilidad. A pesar de que las ontologías son ampliamente utilizadas en la biología, su uso habitualmente se limita a la definición de las jerarquías taxonómicas. La tesis examina la utilidad de las ontologías para la integración de los datos en el desarrollo de los servicios web semánticos. Las ontologías que definen los tipos de datos biológicos tienen un gran valor para la integración de los datos, especialmente ante un cambio continuo de los estándares. La tesis evalúa la ontología BioMoby para la generación de los servicios web conforme con las especificaciones WS-I y los servicios REST. Otro aspecto muy importante de la tesis es el uso de las ontologías para la descripción de los servicios web. La tesis evalúa la ontología WSDL promovida por el consorcio W3C para la descripción de los servicios y su búsqueda. Finalmente, se considera la integración con las plataformas modernas de la ejecución de los flujos de trabajo como Taverna y Galaxy. A pesar de la creciente popularidad del formato JSON, los servicios web dependen mucho del XML. La herramienta OWL2XS facilita el desarrollo de los servicios web semánticos generando un esquema XML a partir de una ontología OWL 2. La integración de los servicios web es difícil de conseguir sin una adaptación de los estándares. La aplicación BioNemus genera de manera automática servicios web estándar a partir de las ontologías BioMoby. La representación semántica de los servicios web simplifica su búsqueda y anotación. El Registro Semántico de Servicios Web (BioSWR) está basado en la ontología WSDL del W3C y proporciona una representación en distintos formatos: OWL 2, WSDL 1.1, WSDL 2.0 y WADL. Para demostrar los beneficios de la descripción semántica de los servicios web se ha desarrollado un plugin para Taverna. También se ha implementado una nueva librería experimental que ha sido usada en la aplicación Galaxy Gears, la cual permite la integración de los servicios web en Galaxy. La tesis explora el alcance de la aplicación de las ontologías para la integración de los datos y los servicios biológicos, proporcionando un amplio conjunto de nuevas aplicaciones.

APA, Harvard, Vancouver, ISO, and other styles

37

Wang, Guilian. "Schema mapping for data transformation and integration." Connect to a 24 p. preview or request complete full text in PDF format. Access restricted to UC campuses, 2006. http://wwwlib.umi.com/cr/ucsd/fullcit?p3211371.

Full text

Abstract:

Thesis (Ph. D.)--University of California, San Diego, 2006.
Title from first page of PDF file (viewed June 7, 2006). Available via ProQuest Digital Dissertations. Vita. Includes bibliographical references (p. 135-142).

APA, Harvard, Vancouver, ISO, and other styles

38

Schallehn, Eike. "Efficient similarity-based operations for data integration." [S.l. : s.n.], 2004. http://deposit.ddb.de/cgi-bin/dokserv?idn=971682631.

Full text

APA, Harvard, Vancouver, ISO, and other styles

39

Wiesner, Christian. "Query evaluation techniques for data integration systems." [S.l.] : [s.n.], 2004. http://deposit.ddb.de/cgi-bin/dokserv?idn=971999139.

Full text

APA, Harvard, Vancouver, ISO, and other styles

40

Zhang, Baoquan. "Integration and management of ophthalmological research data." Thesis, National Library of Canada = Bibliothèque nationale du Canada, 1996. http://www.collectionscanada.ca/obj/s4/f2/dsk2/ftp04/MQ31665.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

41

Keijzer, Ander de. "Management of uncertain data towards unattended integration /." Enschede : University of Twente [Host], 2008. http://doc.utwente.nl/58869.

Full text

APA, Harvard, Vancouver, ISO, and other styles

42

Su, Weifeng. "Domain-based data integration for Web databases /." View abstract or full-text, 2007. http://library.ust.hk/cgi/db/thesis.pl?CSED%202007%20SU.

Full text

APA, Harvard, Vancouver, ISO, and other styles

43

Haak, Liane. "Semantische Integration von Data Warehousing und Wissensmanagement /." Berlin : Dissertation.de - Verl. im Internet, 2008. http://d-nb.info/989917010/04.

Full text

APA, Harvard, Vancouver, ISO, and other styles

44

Wang, Hongxia. "Data service framework for urban information integration." Thesis, University of Salford, 2007. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.490440.

Full text

Abstract:

Comprehensive and accurate information plays a key role in urban planning process. Recent developments in Information Communications Technologies (ICT) have provided considerable challenges and opportunities to improve the management of planning processes and make better use of planning information. However, data sharing and integration are always problematic for urban planning tasks because urban datasets are heterogeneous and scattered in different domains and organisations. It is stated that planners spend about 80 percent of their time to coordinate various datasets and analysis information (Singh 2004). The aim of this research is to develop a technical solution to providing information support for urban planning. The research will focus on planning data representation and integration in order to produce semantically rich urban models.

APA, Harvard, Vancouver, ISO, and other styles

45

Lo, Chi-lik Eric, and 盧至力. "Bridging data integration technology and e-commerce." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2003. http://hub.hku.hk/bib/B29360705.

Full text

APA, Harvard, Vancouver, ISO, and other styles

46

DIAS, SANDRA APARECIDA. "SEMANTIC DATA INTEGRATION WITH AN ONTOLOGY FEDERATION." PONTIFÍCIA UNIVERSIDADE CATÓLICA DO RIO DE JANEIRO, 2006. http://www.maxwell.vrac.puc-rio.br/Busca_etds.php?strSecao=resultado&nrSeq=9148@1.

Full text

Abstract:

COORDENAÇÃO DE APERFEIÇOAMENTO DO PESSOAL DE ENSINO SUPERIOR
O advento da WEB propiciou a disseminação de bases de dados distribuídas e heterogêneas. Por vezes, a resposta a uma consulta demanda o uso de várias destas bases. É necessário, então, algum nível de integração destas. A publicação dessas bases nem sempre segue um padrão semântico. Em função disso parece ser essencial existir um meio de relacionar os diferentes dados para satisfazer tais consultas. Este processo é comumente denominado de integração de dados. A comunidade de Banco de Dados tem conhecimento de métodos para dar conta desta integração no contexto de federações de Bancos de Dados heterogêneos. No entanto, atualmente existem descrições mais ricas e com mais possibilidades de semântica, tais como aquelas induzidas pelo conceito de ontologia. A comunidade de Banco de Dados tem considerado ontologias na solução do problema da integração de Banco de Dados. O alinhamento ou merge de ontologias são algumas das propostas conhecidas da comunidade de WEB semântica. Este trabalho propõe o uso de métodos de merge de ontologias como solução ao problema da construção de uma federação de ontologias como método integrador de fontes de dados. O trabalho inclui a implementação de um estudo de caso na ferramenta Protegé. Este estudo de caso permite discutir aspectos de escalabilidade e de aplicabilidade da proposta como uma solução tecnologicamente viável.
The WEB has spread out the use of heterogeneous distributed databases. Sometimes, the answer to a query demands the use of more than one database. Some level of integration among these databases is desired. However, frequently, the bases were not designed according a unique semantic pattern. Thus, it seems essential to relate the different data, in the respective base, in order to provide an adequate answer to the query. The process of building this relationship is often called data integration. The Data Base community has acquired enough knowledge to deal with this in the context of Data Base Heterogeneous Federation. Nowadays, there are more expressive model descriptions, namely ontologies. The Data Base community has also considered ontologies as a tool to contribute as part of a solution to the data integration problem. The Semantic WEB community defined alignment or merge of ontologies as one of the possible solutions to the some of this integration problem. This work has the aim of using merge of ontologies methods as a mean to define the construction of a Federation of ontologies as a mean to integrate source of data. The dissertation includes a case study written in the Protegé tool. From this case study, a discussion follows on the scalability and applicability of the proposal as a feasible technological solution for data integration.

APA, Harvard, Vancouver, ISO, and other styles

47

Шендрик, Віра Вікторівна, Вера Викторовна Шендрик, Vira Viktorivna Shendryk, A. Boiko, and Y. Mashyn. "Mechanisms of Data Integration in Information Systems." Thesis, Sumy State University, 2016. http://essuir.sumdu.edu.ua/handle/123456789/47066.

Full text

Abstract:

The effectiveness of enterprise management is providing by common use of information systems, which can only increase through flexible data management. The creating mechanisms for data integration are one of the most pressing issues in the field of information system. The nature and complexity of integrating methods are greatly depends on the level of integration, the features of various data sources and their set as a whole, and the identified ways of integration.

APA, Harvard, Vancouver, ISO, and other styles

48

Salazar, Gustavo A. "Integration and visualisation of data in bioinformatics." Doctoral thesis, University of Cape Town, 2015. http://hdl.handle.net/11427/16861.

Full text

Abstract:

Includes bibliographical references
The most recent advances in laboratory techniques aimed at observing and measuring biological processes are characterised by their ability to generate large amounts of data. The more data we gather, the greater the chance of finding clues to understand the systems of life. This, however, is only true if the methods that analyse the generated data are efficient, effective, and robust enough to overcome the challenges intrinsic to the management of big data. The computational tools designed to overcome these challenges should also take into account the requirements of current research. Science demands specialised knowledge for understanding the particularities of each study; in addition, it is seldom possible to describe a single observation without considering its relationship with other processes, entities or systems. This thesis explores two closely related fields: the integration and visualisation of biological data. We believe that these two branches of study are fundamental in the creation of scientific software tools that respond to the ever increasing needs of researchers. The distributed annotation system (DAS) is a community project that supports the integration of data from federated sources and its visualisation on web and stand-alone clients. We have extended the DAS protocol to improve its search capabilities and also to support feature annotation by the community. We have also collaborated on the implementation of MyDAS, a server to facilitate the publication of biological data following the DAS protocol, and contributed in the design of the protein DAS client called DASty. Furthermore, we have developed a tool called probeSearcher, which uses the DAS technology to facilitate the identification of microarray chips that include probes for regions on proteins of interest. Another community project in which we participated is BioJS, an open source library of visualisation components for biological data. This thesis includes a description of the project, our contributions to it and some developed components that are part of it. Finally, and most importantly, we combined several BioJS components over a modular architecture to create PINV, a web based visualiser of protein-protein interaction (PPI) networks, that takes advantage of the features of modern web technologies in order to explore PPI datasets on an almost ubiquitous platform (the web) and facilitates collaboration between scientific peers. This thesis includes a description of the design and development processes of PINV, as well as current use cases that have benefited from the tool and whose feedback has been the source of several improvements to PINV. Collectively, this thesis describes novel software tools that, by using modern web technologies, facilitates the integration, exploration and visualisation of biological data, which has the potential to contribute to our understanding of the systems of life.

APA, Harvard, Vancouver, ISO, and other styles

49

Guan, Xiaowei. "Bioinformatics Approaches to Heterogeneous Omic Data Integration." Case Western Reserve University School of Graduate Studies / OhioLINK, 2012. http://rave.ohiolink.edu/etdc/view?acc_num=case1340302883.

Full text

APA, Harvard, Vancouver, ISO, and other styles

50

ZHAN, YUNSONG. "XML-BASED DATA INTEGRATION FOR APPLICATION INTROPERABILITY." University of Cincinnati / OhioLINK, 2002. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1020432798.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Integration of peptidomic data'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles