Tesis sobre el tema "Ensemble de données RDF"
Crea una cita precisa en los estilos APA, MLA, Chicago, Harvard y otros
Consulte los 50 mejores tesis para su investigación sobre el tema "Ensemble de données RDF".
Junto a cada fuente en la lista de referencias hay un botón "Agregar a la bibliografía". Pulsa este botón, y generaremos automáticamente la referencia bibliográfica para la obra elegida en el estilo de cita que necesites: APA, MLA, Harvard, Vancouver, Chicago, etc.
También puede descargar el texto completo de la publicación académica en formato pdf y leer en línea su resumen siempre que esté disponible en los metadatos.
Explore tesis sobre una amplia variedad de disciplinas y organice su bibliografía correctamente.
Slama, Olfa. "Flexible querying of RDF databases : a contribution based on fuzzy logic". Thesis, Rennes 1, 2017. http://www.theses.fr/2017REN1S089/document.
Texto completoThis thesis concerns the definition of a flexible approach for querying both crisp and fuzzy RDF graphs. This approach, based on the theory of fuzzy sets, makes it possible to extend SPARQL which is the W3C-standardised query language for RDF, so as to be able to express i) fuzzy user preferences on data (e.g., the release year of an album is recent) and on the structure of the data graph (e.g., the path between two friends is required to be short) and ii) more complex user preferences, namely, fuzzy quantified statements (e.g., most of the albums that are recommended by an artist, are highly rated and have been created by a young friend of this artist). We performed some experiments in order to study the performances of this approach. The main objective of these experiments was to show that the extra cost due to the introduction of fuzziness remains limited/acceptable. We also investigated, in a more general framework, namely graph databases, the issue of integrating the same type of fuzzy quantified statements in a fuzzy extension of Cypher which is a declarative language for querying (crisp) graph databases. Some experimental results are reported and show that the extra cost induced by the fuzzy quantified nature of the queries also remains very limited
Abbas, Nacira. "Formal Concept Analysis for Discovering Link Keys in the Web of Data". Electronic Thesis or Diss., Université de Lorraine, 2023. http://www.theses.fr/2023LORR0202.
Texto completoThe Web of data is a global data space that can be seen as an additional layer interconnected with the Web of documents. Data interlinking is the task of discovering identity links across RDF (Resource Description Framework) datasets over the Web of data. We focus on a specific approach for data interlinking, which relies on the “link keys”. A link key has the form of two sets of pairs of properties associated with a pair of classes. For example the link key ({(designation,title)},{(designation,title) (creator,author)},(Book,Novel)), states that whenever an instance “a” of the class “Book” and “b” of the class “Novel”, share at least one value for the properties “creator” and “author” and that, “a” and “b” have the same values for the properties “designation” and “title”, then “a” and “b” denote the same entity. Then (a,owl:sameAs,b) is an identity link over the two datasets. However, link keys are not always provided, and various algorithms have been developed to automatically discover these keys. First, these algorithms focus on finding “link key candidates”. The quality of these candidates is then evaluated using appropriate measures, and valid link keys are selected accordingly. Formal Concept Analysis (FCA) has been closely associated with the discovery of link key candidates, leading to the proposal of an FCA-based algorithm for this purpose. Nevertheless, existing algorithms for link key discovery have certain limitations. First, they do not explicitly specify the associated pairs of classes for the discovered link key candidates, which can lead to inaccurate evaluations. Additionally, the selection strategies employed by these algorithms may also produce less accurate results. Furthermore, redundancy is observed among the sets of discovered candidates, which presents challenges for their visualization, evaluation, and analysis. To address these limitations, we propose to extend the existing algorithms in several aspects. Firstly, we introduce a method based on Pattern Structures, an FCA generalization that can handle non-binary data. This approach allows for explicitly specifying the associated pairs of classes for each link key candidate. Secondly, based on the proposed Pattern Structure, we present two methods for link key selection. The first method is guided by the associated pairs of classes of link keys, while the second method utilizes the lattice generated by the Pattern Structure. These two methods improve the selection compared to the existing strategy. Finally, to address redundancy, we introduce two methods. The first method involves Partition Pattern Structure, which identifies and merges link key candidates that generate the same partitions. The second method is based on hierarchical clustering, which groups candidates producing similar link sets into clusters and selects a representative for each cluster. This approach effectively minimizes redundancy among the link key candidates
Lesnikova, Tatiana. "Liage de données RDF : évaluation d'approches interlingues". Thesis, Université Grenoble Alpes (ComUE), 2016. http://www.theses.fr/2016GREAM011/document.
Texto completoThe Semantic Web extends the Web by publishing structured and interlinked data using RDF.An RDF data set is a graph where resources are nodes labelled in natural languages. One of the key challenges of linked data is to be able to discover links across RDF data sets. Given two data sets, equivalent resources should be identified and linked by owl:sameAs links. This problem is particularly difficult when resources are described in different natural languages.This thesis investigates the effectiveness of linguistic resources for interlinking RDF data sets. For this purpose, we introduce a general framework in which each RDF resource is represented as a virtual document containing text information of neighboring nodes. The context of a resource are the labels of the neighboring nodes. Once virtual documents are created, they are projected in the same space in order to be compared. This can be achieved by using machine translation or multilingual lexical resources. Once documents are in the same space, similarity measures to find identical resources are applied. Similarity between elements of this space is taken for similarity between RDF resources.We performed evaluation of cross-lingual techniques within the proposed framework. We experimentally evaluate different methods for linking RDF data. In particular, two strategies are explored: applying machine translation or using references to multilingual resources. Overall, evaluation shows the effectiveness of cross-lingual string-based approaches for linking RDF resources expressed in different languages. The methods have been evaluated on resources in English, Chinese, French and German. The best performance (over 0.90 F-measure) was obtained by the machine translation approach. This shows that the similarity-based method can be successfully applied on RDF resources independently of their type (named entities or thesauri concepts). The best experimental results involving just a pair of languages demonstrated the usefulness of such techniques for interlinking RDF resources cross-lingually
Tanasescu, Adrian. "Vers un accès sémantique aux données : approche basée sur RDF". Lyon 1, 2007. http://www.theses.fr/2007LYO10069.
Texto completoThe thesis mainly focuses on information retrival through RDF documents querying. Therefore, we propose an approach able to provide complete and pertinent answers to a user formulated SPARQL query. The approach mainly consists of (1) determining, through a similarity measure, whether two RDF graphs are contradictory, by using the associated ontological knowledge, and (2) building pertinent answers through the combination of statements belonging to non contradicting RDF graphs that partially answer a given query. We also present an RDF storage and querying platform, named SyRQuS, whose query answering plan is entirely based on the former proposed querying approach. SyRQuS is a Web based plateform that mainly provides users with a querying interface where queries can be formulated using SPARQL
Ben, Ellefi Mohamed. "La recommandation des jeux de données basée sur le profilage pour le liage des données RDF". Thesis, Montpellier, 2016. http://www.theses.fr/2016MONTT276/document.
Texto completoWith the emergence of the Web of Data, most notably Linked Open Data (LOD), an abundance of data has become available on the web. However, LOD datasets and their inherent subgraphs vary heavily with respect to their size, topic and domain coverage, the schemas and their data dynamicity (respectively schemas and metadata) over the time. To this extent, identifying suitable datasets, which meet specific criteria, has become an increasingly important, yet challenging task to supportissues such as entity retrieval or semantic search and data linking. Particularlywith respect to the interlinking issue, the current topology of the LOD cloud underlines the need for practical and efficient means to recommend suitable datasets: currently, only well-known reference graphs such as DBpedia (the most obvious target), YAGO or Freebase show a high amount of in-links, while there exists a long tail of potentially suitable yet under-recognized datasets. This problem is due to the semantic web tradition in dealing with "finding candidate datasets to link to", where data publishers are used to identify target datasets for interlinking.While an understanding of the nature of the content of specific datasets is a crucial prerequisite for the mentioned issues, we adopt in this dissertation the notion of "dataset profile" - a set of features that describe a dataset and allow the comparison of different datasets with regard to their represented characteristics. Our first research direction was to implement a collaborative filtering-like dataset recommendation approach, which exploits both existing dataset topic proles, as well as traditional dataset connectivity measures, in order to link LOD datasets into a global dataset-topic-graph. This approach relies on the LOD graph in order to learn the connectivity behaviour between LOD datasets. However, experiments have shown that the current topology of the LOD cloud group is far from being complete to be considered as a ground truth and consequently as learning data.Facing the limits the current topology of LOD (as learning data), our research has led to break away from the topic proles representation of "learn to rank" approach and to adopt a new approach for candidate datasets identication where the recommendation is based on the intensional profiles overlap between differentdatasets. By intensional profile, we understand the formal representation of a set of schema concept labels that best describe a dataset and can be potentially enriched by retrieving the corresponding textual descriptions. This representation provides richer contextual and semantic information and allows to compute efficiently and inexpensively similarities between proles. We identify schema overlap by the help of a semantico-frequential concept similarity measure and a ranking criterion based on the tf*idf cosine similarity. The experiments, conducted over all available linked datasets on the LOD cloud, show that our method achieves an average precision of up to 53% for a recall of 100%. Furthermore, our method returns the mappings between the schema concepts across datasets, a particularly useful input for the data linking step.In order to ensure a high quality representative datasets schema profiles, we introduce Datavore| a tool oriented towards metadata designers that provides rankedlists of vocabulary terms to reuse in data modeling process, together with additional metadata and cross-terms relations. The tool relies on the Linked Open Vocabulary (LOV) ecosystem for acquiring vocabularies and metadata and is made available for the community
Ouksili, Hanane. "Exploration et interrogation de données RDF intégrant de la connaissance métier". Thesis, Université Paris-Saclay (ComUE), 2016. http://www.theses.fr/2016SACLV069.
Texto completoAn increasing number of datasets is published on the Web, expressed in languages proposed by the W3C to describe Web data such as RDF, RDF(S) and OWL. The Web has become a unprecedented source of information available for users and applications, but the meaningful usage of this information source is still a challenge. Querying these data sources requires the knowledge of a formal query language such as SPARQL, but it mainly suffers from the lack of knowledge about the source itself, which is required in order to target the resources and properties relevant for the specific needs of the application. The work described in this thesis addresses the exploration of RDF data sources. This exploration is done according to two complementary ways: discovering the themes or topics representing the content of the data source, and providing a support for an alternative way of querying the data sources by using keywords instead of a query formulated in SPARQL. The proposed exploration approach combines two complementary strategies: thematic-based exploration and keyword search. Theme discovery from an RDF dataset consists in identifying a set of sub-graphs which are not necessarily disjoints, and such that each one represents a set of semantically related resources representing a theme according to the point of view of the user. These themes can be used to enable a thematic exploration of the data source where users can target the relevant theme and limit their exploration to the resources composing this theme. Keyword search is a simple and intuitive way of querying data sources. In the case of RDF datasets, this search raises several problems, such as indexing graph elements, identifying the relevant graph fragments for a specific query, aggregating these relevant fragments to build the query results, and the ranking of these results. In our work, we address these different problems and we propose an approach which takes as input a keyword query and provides a list of sub-graphs, each one representing a candidate result for the query. These sub-graphs are ordered according to their relevance to the query. For both keyword search and theme identification in RDF data sources, we have taken into account some external knowledge in order to capture the users needs, or to bridge the gap between the concepts invoked in a query and the ones of the data source. This external knowledge could be domain knowledge allowing to refine the user's need expressed by a query, or to refine the definition of themes. In our work, we have proposed a formalization to this external knowledge and we have introduced the notion of pattern to this end. These patterns represent equivalences between properties and paths in the dataset. They are evaluated and integrated in the exploration process to improve the quality of the result
Michel, Franck. "Intégrer des sources de données hétérogènes dans le Web de données". Thesis, Université Côte d'Azur (ComUE), 2017. http://www.theses.fr/2017AZUR4002/document.
Texto completoTo a great extent, the success of the Web of Data depends on the ability to reach out legacy data locked in silos inaccessible from the web. In the last 15 years, various works have tackled the problem of exposing various structured data in the Resource Description Format (RDF). Meanwhile, the overwhelming success of NoSQL databases has made the database landscape more diverse than ever. NoSQL databases are strong potential contributors of valuable linked open data. Hence, the object of this thesis is to enable RDF-based data integration over heterogeneous data sources and, in particular, to harness NoSQL databases to populate the Web of Data. We propose a generic mapping language, xR2RML, to describe the mapping of heterogeneous data sources into an arbitrary RDF representation. xR2RML relies on and extends previous works on the translation of RDBs, CSV/TSV and XML into RDF. With such an xR2RML mapping, we propose either to materialize RDF data or to dynamically evaluate SPARQL queries on the native database. In the latter, we follow a two-step approach. The first step performs the translation of a SPARQL query into a pivot abstract query based on the xR2RML mapping of the target database to RDF. In the second step, the abstract query is translated into a concrete query, taking into account the specificities of the database query language. Great care is taken of the query optimization opportunities, both at the abstract and the concrete levels. To demonstrate the effectiveness of our approach, we have developed a prototype implementation for MongoDB, the popular NoSQL document store. We have validated the method using a real-life use case in Digital Humanities
Bouhamoum, Redouane. "Découverte automatique de schéma pour les données irrégulières et massives". Electronic Thesis or Diss., université Paris-Saclay, 2021. http://www.theses.fr/2021UPASG081.
Texto completoThe web of data is a huge global data space, relying on semantic web technologies, where a high number of sources are published and interlinked. This data space provides an unprecedented amount of knowledge available for novel applications, but the meaningful usage of its sources is often difficult due to the lack of schema describing the content of these data sources. Several automatic schema discovery approaches have been proposed, but while they provide good quality schemas, their use for massive data sources is a challenge as they rely on costly algorithms. In our work, we are interested in both the scalability and the incrementality of schema discovery approaches for RDF data sources where the schema is incomplete or missing.Furthermore, we extend schema discovery to take into account not only the explicit information provided by a data source, but also the implicit information which can be inferred.Our first contribution consists of a scalable schema discovery approach which extracts the classes describing the content of a massive RDF data source.We have proposed to extract a condensed representation of the source, which will be used as an input to the schema discovery process in order to improve its performances.This representation is a set of patterns, each one representing a combination of properties describing some entities in the dataset. We have also proposed a scalable schema discovery approach relying on a distributed clustering algorithm that forms groups of structurally similar entities representing the classes of the schema.Our second contribution aims at maintaining the generated schema consistent with the data source it describes, as this latter may evolve over time. We propose an incremental schema discovery approach that modifies the set of extracted classes by propagating the changes occurring at the source, in order to keep the schema consistent with its evolutions.Finally, the goal of our third contribution is to extend schema discovery to consider the whole semantics expressed by a data source, which is represented not only by the explicitly declared triples, but also by the ones which can be inferred through reasoning. We propose an extension allowing to take into account all the properties of an entity during schema discovery, represented either by explicit or by implicit triples, which will improve the quality of the generated schema
Rihany, Mohamad. "Keyword Search and Summarization Approaches for RDF Dataset Exploration". Electronic Thesis or Diss., université Paris-Saclay, 2022. http://www.theses.fr/2022UPASG030.
Texto completoAn increasing number of datasets are published on the Web, expressed in the standard languages proposed by the W3C such as RDF, RDF (S), and OWL. These datasets represent an unprecedented amount of data available for users and applications. In order to identify and use the relevant datasets, users and applications need to explore them using queries written in SPARQL, a query language proposed by the W3C. But in order to write a SPARQL query, a user should not only be familiar with the query language but also have knowledge about the content of the RDF dataset in terms of the resources, classes or properties it contains. The goal of this thesis is to provide approaches to support the exploration of these RDF datasets. We have studied two alternative and complementary exploration techniques, keyword search and summarization of an RDF dataset. Keyword search returns RDF graphs in response to a query expressed as a set of keywords, where each resulting graph is the aggregation of elements extracted from the source dataset. These graphs represent possible answers to the keyword query, and they can be ranked according to their relevance. Keyword search in RDF datasets raises the following issues: (i) identifying for each keyword in the query the matching elements in the considered dataset, taking into account the differences of terminology between the keywords and the terms used in the RDF dataset, (ii) combining the matching elements to build the result by defining aggregation algorithms that find the best way of linking matching elements, and finally (iii), finding appropriate metrics to rank the results, as several matching elements may exist for each keyword and consequently several graphs may be returned. In our work, we propose a keyword search approach that addresses these issues. Providing a summarized view of an RDF dataset can help a user in identifying if this dataset is relevant to his needs, and in highlighting its most relevant elements. This could be useful for the exploration of a given dataset. In our work, we propose a novel summarization approach based on the underlying themes of a dataset. Our theme-based summarization approach consists of extracting the existing themes in a data source, and building the summarized view so as to ensure that all these discovered themes are represented. This raises the following questions: (i) how to identify the underlying themes in an RDF dataset? (ii) what are the suitable criteria to identify the relevant elements in the themes extracted from the RDF graph? (iii) how to aggregate and connect the relevant elements to create a theme summary? and finally, (iv) how to create the summary for the whole RDF graph from the generated theme summaries? In our work, we propose a theme-based summarization approach for RDF datasets which answers these questions and provides a summarized representation ensuring that each theme is represented proportionally to its importance in the initial dataset
Lozano, Aparicio Jose Martin. "Data exchange from relational databases to RDF with target shape schemas". Thesis, Lille 1, 2020. http://www.theses.fr/2020LIL1I063.
Texto completoResource Description Framework (RDF) is a graph data model which has recently found the use of publishing on the web data from relational databases. We investigate data exchange from relational databases to RDF graphs with target shapes schemas. Essentially, data exchange models a process of transforming an instance of a relational schema, called the source schema, to a RDF graph constrained by a target schema, according to a set of rules, called source-to-target tuple generating dependencies. The output RDF graph is called a solution. Because the tuple generating dependencies define this process in a declarative fashion, there might be many possible solutions or no solution at all. We study constructive relational to RDF data exchange setting with target shapes schemas, which is composed of a relational source schema, a shapes schema for the target schema, a set of mappings that uses IRI constructors. Furthermore, we assume that any two IRI constructors are non-overlapping. We propose a visual mapping language (VML) that helps non-expert users to specify mappings in this setting. Moreover, we develop a tool called ShERML that performs data exchange with the use of VML and for users that want to understand the model behind VML mappings, we define R2VML, a text-based mapping language, that captures VML and presents a succinct syntax for defining mappings.We investigate the problem of checking consistency: a data exchange setting is consistent if for every input source instance, there is at least one solution. We show that the consistency problem is coNP-complete and provide a static analysis algorithm of the setting that allows to decide if the setting is consistent or not. We study the problem of computing certain answers. An answer is certain if the answer holds in every solution. Typically, certain answers are computed using a universal solution. However, in our setting a universal solution might not exist. Thus, we introduce the notion of universal simulation solution, which always exists and allows to compute certain answers to any class of queries that is robust under simulation. One such class is nested regular expressions (NREs) that are forward i.e., do not use the inverse operation. Using universal simulation solution renders tractable the computation of certain answers to forward NREs (data-complexity).Finally, we investigate the shapes schema elicitation problem that consists of constructing a target shapes schema from a constructive relational to RDF data exchange setting without the target shapes schema. We identity two desirable properties of a good target schema, which are soundness i.e., every produced RDF graph is accepted by the target schema; and completeness i.e., every RDF graph accepted by the target schema can be produced. We propose an elicitation algorithm that is sound for any schema-less data exchange setting, but also that is complete for a large practical class of schema-less settings
Kellou-Menouer, Kenza. "Découverte de schéma pour les données du Web sémantique". Thesis, Université Paris-Saclay (ComUE), 2017. http://www.theses.fr/2017SACLV047/document.
Texto completoAn increasing number of linked data sources are published on the Web. However, their schema may be incomplete or missing. In addition, data do not necessarily follow their schema. This flexibility for describing the data eases their evolution, but makes their exploitation more complex. In our work, we have proposed an automatic and incremental approach enabling schema discovery from the implicit structure of the data. To complement the description of the types in a schema, we have also proposed an approach for finding the possible versions (patterns) for each of them. It proceeds online without having to download or browse the source. This can be expensive or even impossible because the sources may have some access limitations, either on the query execution time, or on the number of queries.We have also addressed the problem of annotating the types in a schema, which consists in finding a set of labels capturing their meaning. We have proposed annotation algorithms which provide meaningful labels using external knowledge bases. Our approach can be used to find meaningful type labels during schema discovery, and also to enrichthe description of existing types.Finally, we have proposed an approach to evaluate the gap between a data source and itsschema. To this end, we have proposed a setof quality factors and the associated metrics, aswell as a schema extension allowing to reflect the heterogeneity among instances of the sametype. Both factors and schema extension are used to analyze and improve the conformity between a schema and the instances it describes
Taki, Sara. "Anonymisation de données liées en utilisant la confidentialité différentielle". Electronic Thesis or Diss., Bourges, INSA Centre Val de Loire, 2023. http://www.theses.fr/2023ISAB0009.
Texto completoThis thesis studies the problem of privacy in linked open data (LOD). Thiswork is at the intersection of long lines of work on data privacy and linked open data.Our goal is to study how the presence of semantics impacts the publication of data andpossible data leaks. We consider RDF as the format to represent LOD and DifferentialPrivacy (DP) as the main privacy concept. DP was initially conceived to define privacyin the relational database (RDB) domain and is based on a quantification of the difficultyfor an attacker observing an output to identify which database among a neighborhoodis used to produce it.The objective of this thesis is four-fold: O1) to improve the privacy of LOD. Inparticular, to propose an approach to construct usable DP-mechanisms on RDF; O2) tostudy how neighborhood definitions over RDB in the presence of foreign key (FK) constraints translate to RDF; O3) to propose new neighborhood definitions over relationaldatabase translating into existing graph concepts to ease the design of DP mechanisms;and O4) to support the implementation of sanitization mechanisms for RDF graphs witha rigorous formal foundation.For O1, we propose a novel approach based on graph projection to adapt DP toRDF. For O2, we determine the privacy model resulting from the translation of popularprivacy model over RDB with FK constraints to RDF. For O3, we propose the restrictdeletion neighborhood over RDB with FK constraints whose translation to the RDFgraph world is equivalent to typed-node neighborhood. Moreover, we propose a looserdefinition translating to typed-outedge neighborhood. For O4, we propose a graphtransformation language based on graph rewriting to serve as a basis for constructingvarious sanitization mechanisms on attributed graphs.We support all our theoretical contributions with proof-of-concept prototypes thatimplement our proposals and are evaluated on real datasets to show the applicability ofour work
Yang, Jitao. "Un modèle de données pour bibliothèques numériques". Thesis, Paris 11, 2012. http://www.theses.fr/2012PA112085.
Texto completoDigital Libraries are complex information systems, storing digital resources (e.g., text, images, sound, audio), as well as knowledge about digital or non-digital resources; this knowledge is referred to as metadata. We propose a data model for digital libraries supporting resource identification, use of metadata and re-use of stored resources, as well as a query language supporting discovery of resources. The model that we propose is inspired by the architecture of the Web, which forms a solid, universally accepted basis for the notions and services expected from a digital library. We formalize our model as a first-order theory, in order to be able to express the basic concepts of digital libraries without being constrained by any technical considerations. The axioms of the theory give the formal semantics of the notions of the model, and at the same time, provide a definition of the knowledge that is implicit in a digital library. The theory is then translated into a Datalog program that, given a digital library, allows to efficiently complete the digital library with the knowledge implicit in it. The goal of our research is to contribute to the information management technology of digital libraries. In this way, we are able to demonstrate the theoretical feasibility of our digital library model, by showing that it can be efficiently implemented. Moreover, we demonstrate our model’s practical feasibility by providing a full translation of the model into RDF and of the query language into SPARQL. We provide a sound and complete calculus for reasoning on the RDF graphs resulting from translation. Based on this calculus, we prove the correctness of both translations, showing that the translation functions preserve the semantics of the digital library and of the query language
Picalausa, Francois. "Guarded structural indexes: theory and application to relational RDF databases". Doctoral thesis, Universite Libre de Bruxelles, 2013. http://hdl.handle.net/2013/ULB-DIPOT:oai:dipot.ulb.ac.be:2013/209432.
Texto completoCet accroissement du volume de données semi-structurées a suscité un intérêt croissant pour le développement de bases de données adaptées. Parmi les différentes approches proposées, on peut distinguer les approches relationnelles et les approches graphes, comme détaillé au Chapitre 3. Les premières visent à exploiter les moteurs de bases de données relationnelles existants, en y intégrant des techniques spécialisées. Les secondes voient les données semistructurées comme des graphes, c’est-à-dire un ensemble de noeuds liés entre eux par des arêtes étiquetées, dont elles exploitent la structure. L’une des techniques de ce domaine, connue sous le nom d’indexation structurelle, vise à résumer les graphes de données, de sorte à pouvoir identifier rapidement les données utiles au traitement d’une requête.
Les index structurels classiques sont construits sur base des notions de simulation et de bisimulation sur des graphes. Ces notions, qui sont d’usage dans de nombreux domaines tels que la vérification, la sécurité, et le stockage de données, sont des relations sur les noeuds des graphes. Fondamentalement, ces notions caractérisent le fait que deux noeuds partagent certaines caractéristiques telles qu’un même voisinage.
Bien que les approches graphes soient efficaces en pratique, elles présentent des limitations dans le cadre de RDF et son langage de requêtes SPARQL. Les étiquettes sont, dans cette optique, distinctes des noeuds du graphe .Dans le modèle décrit par RDF et supporté par SPARQL, les étiquettes et noeuds font néanmoins partie du même ensemble. C’est pourquoi, les approches graphes ne supportent qu’un sous-ensemble des requêtes SPARQL. Au contraire, les approches relationnelles sont fidèles au modèle RDF, et peuvent répondre au différentes requêtes SPARQL.
La question à laquelle nous souhaitons répondre dans cette thèse est de savoir si les approches relationnelles et graphes sont incompatible, ou s’il est possible de les combiner de manière avantageuse. En particulier, il serait souhaitable de pouvoir conserver la performance des approches graphe, et la généralité des approches relationnelles. Dans ce cadre, nous réalisons un index structurel adapté aux données relationnelles.
Nous nous basons sur une méthodologie décrite par Fletcher et ses coauteurs pour la conception d’index structurels. Cette méthodologie repose sur trois composants principaux. Un premier composant est une caractérisation dite structurelle du langage de requêtes à supporter. Il s’agit ici de pouvoir identifier les données qui sont retournées en même temps par n’importe quelle requête du langage aussi précisément que possible. Un second composant est un algorithme qui doit permettre de grouper efficacement les données qui sont retournées en même temps, d’après la caractérisation structurelle. Le troisième composant est l’index en tant que tel. Il s’agit d’une structure de données qui doit permettre d’identifier les groupes de données, générés par l’algorithme précédent pour répondre aux requêtes.
Dans un premier temps, il faut remarquer que le langage SPARQL pris dans sa totalité ne se prête pas à la réalisation d’index structurels efficaces. En effet, le fondement des requêtes SPARQL se situe dans l’expression de requêtes conjonctives. La caractérisation structurelle des requêtes conjonctives est connue, mais ne se prête pas à la construction d’algorithmes efficaces pour le groupement. Néanmoins, l’étude empirique des requêtes SPARQL posées en pratique que nous réalisons au Chapitre 5 montre que celles-ci sont principalement des requêtes conjonctives acycliques. Les requêtes conjonctives acycliques sont connues dans la littérature pour admettre des algorithmes d’évaluation efficaces.
Le premier composant de notre index structurel, introduit au Chapitre
6, est une caractérisation des requêtes conjonctives acycliques. Cette
caractérisation est faite en termes de guarded simulation. Pour les graphes la
notion de simulation est une version restreinte de la notion de bisimulation.
Similairement, nous introduisons la notion de guarded simulation comme une
restriction de la notion de guarded bisimulation, une extension connue de la
notion de bisimulation aux données relationelles.
Le Chapitre 7 offre un second composant de notre index structurel. Ce composant est une structure de données appelée guarded structural index qui supporte le traitement de requêtes conjonctives quelconques. Nous montrons que, couplé à la caractérisation structurelle précédente, cet index permet d’identifier de manière optimale les données utiles au traitement de requêtes conjonctives acycliques.
Le Chapitre 8 constitue le troisième composant de notre index structurel et propose des méthodes efficaces pour calculer la notion de guarded simulation. Notre algorithme consiste essentiellement en une transformation d’une base de données en un graphe particulier, sur lequel les notions de simulation et guarded simulation correspondent. Il devient alors possible de réutiliser les algorithmes existants pour calculer des relations de simulation.
Si les chapitres précédents définissent une base nécessaire pour un index structurel visant les données relationnelles, ils n’intègrent pas encore cet index dans le contexte d’un moteur de bases de données relationnelles. C’est ce que propose le Chapitre 9, en développant des méthodes qui permettent de prendre en compte l’index durant le traitement d’une requête SPARQL. Des résultats expérimentaux probants complètent cette étude.
Ce travail apporte donc une première réponse positive à la question de savoir s’il est possible de combiner de manière avantageuse les approches relationnelles et graphes de stockage de données RDF.
Doctorat en Sciences de l'ingénieur
info:eu-repo/semantics/nonPublished
Galicia, Auyón Jorge Armando. "Revisiting Data Partitioning for Scalable RDF Graph Processing Combining Graph Exploration and Fragmentation for RDF Processing Query Optimization for Large Scale Clustered RDF Data RDFPart- Suite: Bridging Physical and Logical RDF Partitioning. Reverse Partitioning for SPARQL Queries: Principles and Performance Analysis. ShouldWe Be Afraid of Querying Billions of Triples in a Graph-Based Centralized System? EXGRAF: Exploration et Fragmentation de Graphes au Service du Traitement Scalable de Requˆetes RDF". Thesis, Chasseneuil-du-Poitou, Ecole nationale supérieure de mécanique et d'aérotechnique, 2021. http://www.theses.fr/2021ESMA0001.
Texto completoThe Resource Description Framework (RDF) and SPARQL are very popular graph-based standards initially designed to represent and query information on the Web. The flexibility offered by RDF motivated its use in other domains and today RDF datasets are great information sources. They gather billions of triples in Knowledge Graphs that must be stored and efficiently exploited. The first generation of RDF systems was built on top of traditional relational databases. Unfortunately, the performance in these systems degrades rapidly as the relational model is not suitable for handling RDF data inherently represented as a graph. Native and distributed RDF systems seek to overcome this limitation. The former mainly use indexing as an optimization strategy to speed up queries. Distributed and parallel RDF systems resorts to data partitioning. The logical representation of the database is crucial to design data partitions in the relational model. The logical layer defining the explicit schema of the database provides a degree of comfort to database designers. It lets them choose manually or automatically (through advisors) the tables and attributes to be partitioned. Besides, it allows the partitioning core concepts to remain constant regardless of the database management system. This design scheme is no longer valid for RDF databases. Essentially, because the RDF model does not explicitly enforce a schema since RDF data is mostly implicitly structured. Thus, the logical layer is inexistent and data partitioning depends strongly on the physical implementations of the triples on disk. This situation contributes to have different partitioning logics depending on the target system, which is quite different from the relational model’s perspective. In this thesis, we promote the novel idea of performing data partitioning at the logical level in RDF databases. Thereby, we first process the RDF data graph to support logical entity-based partitioning. After this preparation, we present a partitioning framework built upon these logical structures. This framework is accompanied by data fragmentation, allocation, and distribution procedures. This framework was incorporated to a centralized (RDF_QDAG) and a distributed (gStoreD) triple store. We conducted several experiments that confirmed the feasibility of integrating our framework to existent systems improving their performances for certain queries. Finally, we design a set of RDF data partitioning management tools including a data definition language (DDL) and an automatic partitioning wizard
Bato, Mary Grace. "Vers une assimilation des données de déformation en volcanologie". Thesis, Université Grenoble Alpes (ComUE), 2018. http://www.theses.fr/2018GREAU018/document.
Texto completoTracking magma emplacement at shallow depth as well as its migration towards the Earth's surface is crucial to forecast volcanic eruptions.With the recent advances in Interferometric Synthetic Aperture Radar (InSAR) imaging and the increasing number of continuous Global Navigation Satellite System (GNSS) networks recorded on volcanoes, it is now possible to provide continuous and spatially extensive evolution of surface displacements during inter-eruptive periods. For basaltic volcanoes, these measurements combined with simple dynamical models can be exploited to characterise and to constrain magma pressure building within one or several magma reservoirs, allowing better predictive information on the emplacement of magma at shallow depths. Data assimilation—a sequential time-forward process that best combines models and observations, sometimes a priori information based on error statistics, to predict the state of a dynamical system—has recently gained popularity in various fields of geoscience (e.g. ocean-weather forecasting, geomagnetism and natural resources exploration). In this dissertation, I present the very first application of data assimilation in volcanology from synthetic tests to analyzing real geodetic data.The first part of this work focuses on the development of strategies in order to test the applicability and to assess the potential of data assimilation, in particular, the Ensemble Kalman Filter (EnKF) using a simple two-chamber dynamical model (Reverso2014) and artificial geodetic data. Synthetic tests are performed in order to address the following: 1) track the magma pressure evolution at depth and reconstruct the synthetic ground surface displacements as well as estimate non-evolving uncertain model parameters, 2) properly assimilate GNSS and InSAR data, 3) highlight the strengths and weaknesses of EnKF in comparison with a Bayesian-based inversion technique (e.g. Markov Chain Monte Carlo). Results show that EnKF works well with the synthetic cases and there is a great potential in utilising data assimilation for real-time monitoring of volcanic unrest.The second part is focused on applying the strategy that we developed through synthetic tests in order to forecast the rupture of a magma chamber in real time. We basically explored the 2004-2011 inter-eruptive dataset at Grímsvötn volcano in Iceland. Here, we introduced the concept of “eruption zones” based on the evaluation of the probability of eruption at each time step estimated as the percentage of model ensembles that exceeded their failure overpressure values initially assigned following a given distribution. Our results show that when 25 +/- 1% of the model ensembles exceeded the failure overpressure, an actual eruption is imminent. Furthermore, in this chapter, we also extend the previous synthetic tests by further enhancing the EnKF strategy of assimilating geodetic data in order to adapt to real world problems such as, the limited amount of geodetic data available to monitor ice-covered active volcanoes. Common diagnostic tools in data assimilation are presented.Finally, I demonstrate that in addition to the interest of predicting volcanic eruptions, sequential assimilation of geodetic data on the basis of EnKF shows a unique potential to give insights into volcanic system roots. Using the two-reservoir dynamical model for Grímsvötn 's plumbing system and assuming a fixed geometry and constant magma properties, we retrieve the temporal evolution of the basal magma inflow beneath Grímsvötn that drops up to 85% during the 10 months preceding the initiation of the Bárdarbunga rifting event. The loss of at least 0.016 km3 in the magma supply of Grímsvötn is interpreted as a consequence of magma accumulation beneath Bárdarbunga and subsequent feeding of the Holuhraun eruption 41 km away
Alam, Mehwish. "Découverte interactive de connaissances dans le web des données". Thesis, Université de Lorraine, 2015. http://www.theses.fr/2015LORR0158/document.
Texto completoRecently, the “Web of Documents” has become the “Web of Data”, i.e., the documents are annotated in the form of RDF making this human processable data directly processable by machines. This data can further be explored by the user using SPARQL queries. As web clustering engines provide classification of the results obtained by querying web of documents, a framework for providing classification over SPARQL query answers is also needed to make sense of what is contained in the data. Exploratory Data Mining focuses on providing an insight into the data. It also allows filtering of non-interesting parts of data by directly involving the domain expert in the process. This thesis contributes in aiding the user in exploring Linked Data with the help of exploratory data mining. We study three research directions, i.e., 1) Creating views over RDF graphs and allow user interaction over these views, 2) assessing the quality and completing RDF data and finally 3) simultaneous navigation/exploration over heterogeneous and multiple resources present on Linked Data. Firstly, we introduce a solution modifier i.e., View By to create views over RDF graphs by classifying SPARQL query answers with the help of Formal Concept Analysis. In order to navigate the obtained concept lattice and extract knowledge units, we develop a new tool called RV-Explorer (Rdf View eXplorer) which implements several navigational modes. However, this navigation/exploration reveal several incompletions in the data sets. In order to complete the data, we use association rule mining for completing RDF data. Furthermore, for providing navigation and exploration directly over RDF graphs along with background knowledge, RDF triples are clustered w.r.t. background knowledge and these clusters can then be navigated and interactively explored. Finally, it can be concluded that instead of providing direct exploration we use FCA as an aid for clustering RDF data and allow user to explore these clusters of data and enable the user to reduce his exploration space by interaction
Huang, Xin. "Querying big RDF data : semantic heterogeneity and rule-based inconsistency". Electronic Thesis or Diss., Sorbonne Paris Cité, 2016. http://www.theses.fr/2016USPCB124.
Texto completoSemantic Web is the vision of next generation of Web proposed by Tim Berners-Lee in 2001. Indeed, with the rapid development of Semantic Web technologies, large-scale RDF data already exist as linked open data, and their number is growing rapidly. Traditional Semantic Web querying and reasoning tools are designed to run in stand-alone environment. Therefor, Processing large-scale bulk data computation using traditional solutions will result in bottlenecks of memory space and computational performance inevitably. Large volumes of heterogeneous data are collected from different data sources by different organizations. In this context, different sources always exist inconsistencies and uncertainties which are difficult to identify and evaluate. To solve these challenges of Semantic Web, the main research contents and innovative approaches are proposed as follows. For these purposes, we firstly developed an inference based semantic entity resolution approach and linking mechanism when the same entity is provided in multiple RDF resources described using different semantics and URIs identifiers. We also developed a MapReduce based rewriting engine for Sparql query over big RDF data to handle the implicit data described intentionally by inference rules during query evaluation. The rewriting approach also deal with the transitive closure and cyclic rules to provide a rich inference language as RDFS and OWL. The second contribution concerns the distributed inconsistency processing. We extend the approach presented in first contribution by taking into account inconsistency in the data. This includes: (1)Rules based inconsistency detection with the help of our query rewriting engine; (2)Consistent query evaluation in three different semantics. The third contribution concerns the reasoning and querying over large-scale uncertain RDF data. We propose an MapReduce based approach to deal with large-scale reasoning with uncertainty. Unlike possible worlds semantic, we propose an algorithm for generating intensional Sparql query plan over probabilistic RDF graph for computing the probabilities of results within the query
Roatis, Alexandra. "Efficient Querying and Analytics of Semantic Web Data". Thesis, Paris 11, 2014. http://www.theses.fr/2014PA112218/document.
Texto completoThe utility and relevance of data lie in the information that can be extracted from it.The high rate of data publication and its increased complexity, for instance the heterogeneous, self-describing Semantic Web data, motivate the interest in efficient techniques for data manipulation.In this thesis we leverage mature relational data management technology for querying Semantic Web data.The first part focuses on query answering over data subject to RDFS constraints, stored in relational data management systems. The implicit information resulting from RDF reasoning is required to correctly answer such queries. We introduce the database fragment of RDF, going beyond the expressive power of previously studied fragments. We devise novel techniques for answering Basic Graph Pattern queries within this fragment, exploring the two established approaches for handling RDF semantics, namely graph saturation and query reformulation. In particular, we consider graph updates within each approach and propose a method for incrementally maintaining the saturation. We experimentally study the performance trade-offs of our techniques, which can be deployed on top of any relational data management engine.The second part of this thesis considers the new requirements for data analytics tools and methods emerging from the development of the Semantic Web. We fully redesign, from the bottom up, core data analytics concepts and tools in the context of RDF data. We propose the first complete formal framework for warehouse-style RDF analytics. Notably, we define analytical schemas tailored to heterogeneous, semantic-rich RDF graphs, analytical queries which (beyond relational cubes) allow flexible querying of the data and the schema as well as powerful aggregation and OLAP-style operations. Experiments on a fully-implemented platform demonstrate the practical interest of our approach
Dia, Amadou Fall. "Filtrage sémantique et gestion distribuée de flux de données massives". Electronic Thesis or Diss., Sorbonne université, 2018. http://www.theses.fr/2018SORUS495.
Texto completoOur daily use of the Internet and related technologies generates, at a rapid and variable speeds, large volumes of heterogeneous data issued from sensor networks, search engine logs, multimedia content sites, weather forecasting, geolocation, Internet of Things (IoT) applications, etc. Processing such data in conventional databases (Relational Database Management Systems) may be very expensive in terms of time and memory storage resources. To effectively respond to the needs of rapid decision-making, these streams require real-time processing. Data Stream Management Systems (SGFDs) evaluate queries on the recent data of a stream within structures called windows. The input data are different formats such as CSV, XML, RSS, or JSON. This heterogeneity lock comes from the nature of the data streams and must be resolved. For this, several research groups have benefited from the advantages of semantic web technologies (RDF and SPARQL) by proposing RDF data streams processing systems called RSPs. However, large volumes of RDF data, high input streams, concurrent queries, combination of RDF streams and large volumes of stored RDF data and expensive processing drastically reduce the performance of these systems. A new approach is required to considerably reduce the processing load of RDF data streams. In this thesis, we propose several complementary solutions to reduce the processing load in centralized environment. An on-the-fly RDF graphs streams sampling approach is proposed to reduce data and processing load while preserving semantic links. This approach is deepened by adopting a graph-oriented summary approach to extract the most relevant information from RDF graphs by using centrality measures issued from the Social Networks Analysis. We also adopt a compressed format of RDF data and propose an approach for querying compressed RDF data without decompression phase. To ensure parallel and distributed data streams management, the presented work also proposes two solutions for reducing the processing load in distributed environment. An engine and parallel processing approaches and distributed RDF graphs streams. Finally, an optimized processing approach for static and dynamic data combination operations is also integrated into a new distributed RDF graphs streams management system
Pinson, Franck. "Ajustement de primitives d'objets de forme libre sur un ensemble de données réelles". Compiègne, 1989. http://www.theses.fr/1989COMPD179.
Texto completoLê, Thanh Vu. "Visualisation interactive 3D pour un ensemble de données géographiques de très grande taille". Pau, 2011. http://www.theses.fr/2011PAUU3005.
Texto completoReal-time terrain rendering remains an active area of research for a lot of modern computer based applications such as geographic information systems (GIS), interactive 3D games, flights simulators or virtual reality. The technological breakthroughs in data aquisition, coupled with recent advances in display technology have simultaneously led to substantial increases in resolution of both the Digital Elevation Models (DEM) and the various displays used to present this information. In this phD, we have presented a new out-of-core terrain visualization algorithm that achieves per-pixel accurate shading of large textured elevation maps in real-time : our first contribution is the LOD scheme which is based on a small precomputed quadtree of geometric errors, whose nodes are selected for asynchronous loading and rendering depending on a projection in screenspace of those errors. The terrain data and its color texture are manipulated by the CPU in a unified manner as a collection of raster image patches, whose dimensions depends on their screen-space occupancy ; our second contribution is a novel method to remove artifacts that appear on the border between quadtree blocks, we generate a continuous surface without needing additional mesh ; our latest contribution is an effective method adapted to our data structure for the geomorphing, it can be implemented entirely on the GPU. The presented framework exhibits several interesting features over other existing techniques : there is no mesh manipulation or mesh data structures required ; terrain geometric complexity only depends on projected elevation error views from above result in very coarse meshes), lower geometric complexity degrades terrain silhouettes but not details brought in through normal map shading, real-time rendering with support for progressive data loading ; and geometric information and color textures are similarly and efficiently handled as raster data by the CPU. Due to simplified data structures, the system is compact, CPU and GPU efficient and is simple to implement
Drouet, d'Aubigny Gérard. "L'analyse multidimensionnelle des données de dissimilarité : [thèse soutenue sur un ensemble de travaux]". Grenoble 1, 1989. http://tel.archives-ouvertes.fr/tel-00332393.
Texto completoRabah, Mazouzi. "Approches collaboratives pour la classification des données complexes". Thesis, Paris 8, 2016. http://www.theses.fr/2016PA080079.
Texto completoThis thesis focuses on the collaborative classification in the context of complex data, in particular the context of Big Data, we used some computational paradigms to propose new approaches based on HPC technologies. In this context, we aim at offering massive classifiers in the sense that the number of elementary classifiers that make up the multiple classifiers system can be very high. In this case, conventional methods of interaction between classifiers is no longer valid and we had to propose new forms of interaction, where it is not constrain to take all classifiers predictions to build an overall prediction. According to this, we found ourselves faced with two problems: the first is the potential of our approaches to scale up. The second, is the diversity that must be created and maintained within the system, to ensure its performance. Therefore, we studied the distribution of classifiers in a cloud-computing environment, this multiple classifiers system can be massive and their properties are those of a complex system. In terms of diversity of data, we proposed a training data enrichment approach for the generation of synthetic data from analytical models that describe a part of the phenomenon studied. so, the mixture of data reinforces learning classifiers. The experimentation made have shown the great potential for the substantial improvement of classification results
Abidi, Amna. "Imperfect RDF Databases : From Modelling to Querying". Thesis, Chasseneuil-du-Poitou, Ecole nationale supérieure de mécanique et d'aérotechnique, 2019. http://www.theses.fr/2019ESMA0008/document.
Texto completoThe ever-increasing interest of RDF data on the Web has led to several and important research efforts to enrich traditional RDF data formalism for the exploitation and analysis purpose. The work of this thesis is a part of the continuation of those efforts by addressing the issue of RDF data management in presence of imperfection (untruthfulness, uncertainty, etc.). The main contributions of this dissertation are as follows. (1) We tackled the trusted RDF data model. Hence, we proposed to extend the skyline queries over trust RDF data, which consists in extracting the most interesting trusted resources according to user-defined criteria. (2) We studied via statistical methods the impact of the trust measure on the Trust-skyline set.(3) We integrated in the structure of RDF data (i.e., subject-property-object triple) a fourth element expressing a possibility measure to reflect the user opinion about the truth of a statement.To deal with possibility requirements, appropriate framework related to language is introduced, namely Pi-SPARQL, that extends SPARQL to be possibility-aware query language.Finally, we studied a new skyline operator variant to extract possibilistic RDF resources that are possibly dominated by no other resources in the sense of Pareto optimality
Ren, Xiangnan. "Traitement et raisonnement distribués des flux RDF". Thesis, Paris Est, 2018. http://www.theses.fr/2018PESC1139/document.
Texto completoReal-time processing of data streams emanating from sensors is becoming a common task in industrial scenarios. In an Internet of Things (IoT) context, data are emitted from heterogeneous stream sources, i.e., coming from different domains and data models. This requires that IoT applications efficiently handle data integration mechanisms. The processing of RDF data streams hence became an important research field. This trend enables a wide range of innovative applications where the real-time and reasoning aspects are pervasive. The key implementation goal of such application consists in efficiently handling massive incoming data streams and supporting advanced data analytics services like anomaly detection. However, a modern RSP engine has to address volume and velocity characteristics encountered in the Big Data era. In an on-going industrial project, we found out that a 24/7 available stream processing engine usually faces massive data volume, dynamically changing data structure and workload characteristics. These facts impact the engine's performance and reliability. To address these issues, we propose Strider, a hybrid adaptive distributed RDF Stream Processing engine that optimizes logical query plan according to the state of data streams. Strider has been designed to guarantee important industrial properties such as scalability, high availability, fault-tolerant, high throughput and acceptable latency. These guarantees are obtained by designing the engine's architecture with state-of-the-art Apache components such as Spark and Kafka. Moreover, an increasing number of processing jobs executed over RSP engines are requiring reasoning mechanisms. It usually comes at the cost of finding a trade-off between data throughput, latency and the computational cost of expressive inferences. Therefore, we extend Strider to support real-time RDFS+ (i.e., RDFS + owl:sameAs) reasoning capability. We combine Strider with a query rewriting approach for SPARQL that benefits from an intelligent encoding of knowledge base. The system is evaluated along different dimensions and over multiple datasets to emphasize its performance. Finally, we have stepped further to exploratory RDF stream reasoning with a fragment of Answer Set Programming. This part of our research work is mainly motivated by the fact that more and more streaming applications require more expressive and complex reasoning tasks. The main challenge is to cope with the large volume and high-velocity dimensions in a scalable and inference-enabled manner. Recent efforts in this area still missing the aspect of system scalability for stream reasoning. Thus, we aim to explore the ability of modern distributed computing frameworks to process highly expressive knowledge inference queries over Big Data streams. To do so, we consider queries expressed as a positive fragment of LARS (a temporal logic framework based on Answer Set Programming) and propose solutions to process such queries, based on the two main execution models adopted by major parallel and distributed execution frameworks: Bulk Synchronous Parallel (BSP) and Record-at-A-Time (RAT). We implement our solution named BigSR and conduct a series of evaluations. Our experiments show that BigSR achieves high throughput beyond million-triples per second using a rather small cluster of machines
Barbe, Philippe. "Ensemble d'information de marché et détermination des taux de change : l'apport des données d'enquête". Bordeaux 4, 1997. http://www.theses.fr/1997BOR40011.
Texto completoThe object of this thesis is to study the functionning of the foreign exchange market when agents don't know the "true model". So we reject the rational expectations hypothesis for exchange rate. Therefore, it is important to know how agents on the market form their information set. To answer to this question, we adopt a methodology based on survey data, and a concept of restricted rationality, the economically rational expectations. Our results are the following. First of all, the market information set has two components, a fundamental component and a technical component. The first component is based on economic indicators analysis and command the anticipation function when data are available on the market. The second component is important when fondamental information are not available. Furthermore, the analysis of market expectations shows that these variables are self-fulfilling in short-term and more heterogeneous in long-term
Alam, Mehwish. "Découverte interactive de connaissances dans le web des données". Electronic Thesis or Diss., Université de Lorraine, 2015. http://www.theses.fr/2015LORR0158.
Texto completoRecently, the “Web of Documents” has become the “Web of Data”, i.e., the documents are annotated in the form of RDF making this human processable data directly processable by machines. This data can further be explored by the user using SPARQL queries. As web clustering engines provide classification of the results obtained by querying web of documents, a framework for providing classification over SPARQL query answers is also needed to make sense of what is contained in the data. Exploratory Data Mining focuses on providing an insight into the data. It also allows filtering of non-interesting parts of data by directly involving the domain expert in the process. This thesis contributes in aiding the user in exploring Linked Data with the help of exploratory data mining. We study three research directions, i.e., 1) Creating views over RDF graphs and allow user interaction over these views, 2) assessing the quality and completing RDF data and finally 3) simultaneous navigation/exploration over heterogeneous and multiple resources present on Linked Data. Firstly, we introduce a solution modifier i.e., View By to create views over RDF graphs by classifying SPARQL query answers with the help of Formal Concept Analysis. In order to navigate the obtained concept lattice and extract knowledge units, we develop a new tool called RV-Explorer (Rdf View eXplorer) which implements several navigational modes. However, this navigation/exploration reveal several incompletions in the data sets. In order to complete the data, we use association rule mining for completing RDF data. Furthermore, for providing navigation and exploration directly over RDF graphs along with background knowledge, RDF triples are clustered w.r.t. background knowledge and these clusters can then be navigated and interactively explored. Finally, it can be concluded that instead of providing direct exploration we use FCA as an aid for clustering RDF data and allow user to explore these clusters of data and enable the user to reduce his exploration space by interaction
Dehainsala, Hondjack. "Explicitation de la sémantique dans lesbases de données : Base de données à base ontologique et le modèle OntoDB". Phd thesis, Université de Poitiers, 2007. http://tel.archives-ouvertes.fr/tel-00157595.
Texto completoen termes de classes et de propriétés, ainsi que des relations qui les lient. Avec le développement de
modèles d'ontologies stables dans différents domaines, OWL dans le domaine duWeb sémantique,
PLIB dans le domaine technique, de plus en plus de données (ou de métadonnées) sont décrites par référence à ces ontologies. La taille croissante de telles données rend nécessaire de les gérer au sein de bases de données originales, que nous appelons bases de données à base ontologique (BDBO), et qui possèdent la particularité de représenter, outre les données, les ontologies qui en définissent le sens. Plusieurs architectures de BDBO ont ainsi été proposées au cours des dernières années. Les chémas qu'elles utilisent pour la représentation des données sont soit constitués d'une unique table de triplets de type (sujet, prédicat, objet), soit éclatés en des tables unaires et binaires respectivement pour chaque classe et pour chaque propriété. Si de telles représentations permettent une grande flexibilité dans la structure des données représentées, elles ne sont ni susceptibles de passer à grande échelle lorsque chaque instance est décrite par un nombre significatif de propriétés, ni adaptée à la structure des bases de données usuelles, fondée sur les relations n-aires. C'est ce double inconvénient que vise à résoudre le modèle OntoDB. En introduisant des hypothèses de typages qui semblent acceptables dans beaucoup de domaine d'application, nous proposons une architecture de BDBO constituée de quatre parties : les deux premières parties correspondent à la structure usuelle des bases de données : données reposant sur un schéma logique de données, et méta-base décrivant l'ensemble de la structure de tables.
Les deux autres parties, originales, représentent respectivement les ontologies, et le méta-modèle
d'ontologie au sein d'un méta-schéma réflexif. Des mécanismes d'abstraction et de nomination permettent respectivement d'associer à chaque donnée le concept ontologique qui en définit le sens, et d'accéder aux données à partir des concepts, sans se préoccuper de la représentation des données. Cette architecture permet à la fois de gérer de façon efficace des données de grande taille définies par référence à des ontologies (données à base ontologique), mais aussi d'indexer des bases de données usuelles au niveau connaissance en leur adjoignant les deux parties : ontologie et méta-schéma. Le modèle d'architecture que nous proposons a été validé par le développement d'un prototype opérationnel implanté sur le système PostgreSQL avec le modèle d'ontologie PLIB. Nous présentons également une évaluation comparative de nos propositions aux modèles présentés antérieurement.
Farchi, Alban. "On the localisation of ensemble data assimilation methods". Thesis, Paris Est, 2019. http://www.theses.fr/2019PESC1034.
Texto completoData assimilation is the mathematical discipline which gathers all the methods designed to improve the knowledge of the state of a dynamical system using both observations and modelling results of this system. In the geosciences, data assimilation it mainly applied to numerical weather prediction. It has been used in operational centres for several decades, and it has significantly contributed to the increase in quality of the forecasts.Ensemble methods are powerful tools to reduce the dimension of the data assimilation systems. Currently, the two most widespread classes of ensemble data assimilation methods are the ensemble Kalman filter (EnKF) and the particle filter (PF). The success of the EnKF in high-dimensional geophysical systems is largely due to the use of localisation. Localisation is based on the assumption that correlations between state variables in a dynamical system decrease at a fast rate with the distance. In this thesis, we have studied and improved localisation methods for ensemble data assimilation.The first part is dedicated to the implementation of localisation in the PF. The recent developments in local particle filtering are reviewed, and a generic and theoretical classification of local PF algorithms is introduced, with an emphasis on the advantages and drawbacks of each category. Alongside the classification, practical solutions to the difficulties of local particle filtering are suggested. The local PF algorithms are tested and compared using twin experiments with low- to medium-order systems. Finally, we consider the case study of the prediction of the tropospheric ozone using concentration measurements. Several data assimilation algorithms, including local PF algorithms, are applied to this problem and their performances are compared.The second part is dedicated to the implementation of covariance localisation in the EnKF. We show how covariance localisation can be efficiently implemented in the deterministic EnKF using an augmented ensemble. The proposed algorithm is tested using twin experiments with a medium-order model and satellite-like observations. Finally, the consistency of the deterministic EnKF with covariance localisation is studied in details. A new implementation is proposed and compared to the original one using twin experiments with low-order models
Khelil, Abdallah. "Gestion et optimisation des données massives issues du Web Combining graph exploration and fragmentation for scalable rdf query processing Should We Be Afraid of Querying Billions of Triples in a Graph-Based Centralized System? EXGRAF : Exploration et Fragmentation de Graphes au Service du Traitement Scalable de Requˆetes RDF". Thesis, Chasseneuil-du-Poitou, Ecole nationale supérieure de mécanique et d'aérotechnique, 2020. http://www.theses.fr/2020ESMA0009.
Texto completoBig Data represents a challenge not only for the socio-economic world but also for scientific research. Indeed, as has been pointed out in several scientific articles and strategic reports, modern computer applications are facing new problems and issues that are mainly related to the storage and the exploitation of data generated by modern observation and simulation instruments. The management of such data represents a real bottleneck which has the effect of slowing down the exploitation of the various data collected not only in the framework of international scientific programs but also by companies, the latter relying increasingly on the analysis of large-scale data. Much of this data is published today on the WEB. Indeed, we are witnessing an evolution of the traditional web, designed basically to manage documents, to a web of data that allows to offer mechanisms for querying semantic information. Several data models have been proposed to represent this information on the Web. The most important is the Resource Description Framework (RDF) which provides a simple and abstract representation of knowledge for resources on the Web. Each semantic Web fact can be encoded with an RDF triple. In order to explore and query structured information expressed in RDF, several query languages have been proposed over the years. In 2008,SPARQL became the official W3C Recommendation language for querying RDF data.The need to efficiently manage and query RDF data has led to the development of new systems specifically designed to process this data format. These approaches can be categorized as centralized that rely on a single machine to manage RDF data and distributed that can combine multiple machines connected with a computer network. Some of these approaches are based on an existing data management system such as Virtuoso and Jena, others relies on an approach specifically designed for the management of RDF triples such as GRIN, RDF3X and gStore. With the evolution ofRDF datasets (e.g. DBPedia) and Sparql, most systems have become obsolete and/or inefficient. For example, no one of existing centralized system is able to manage 1 billion triples provided under the WatDiv benchmark. Distributed systems would allow under certain conditions to improve this point but consequently leads a performance degradation. In this Phd thesis, we propose the centralized system "RDF_QDAG" that allows to find a good compromise between scalability and performance. We propose to combine physical data fragmentation and data graph exploration."RDF_QDAG" supports multiple types of queries based not only on basic graph patterns but also that incorporate filters based on regular expressions and aggregation and sorting functions. "RDF_QDAG" relies on the Volcano execution model, which allows controlling the main memory, avoiding any overflow even if the hardware configuration is limited. To the best of our knowledge, "RDF_QDAG" is the only centralized system that good performance when manage several billion triples. We compared this system with other systems that represent the state of the art in RDF data management: a relational approach (Virtuoso), a graph-based approach (g-Store), an intensive indexing approach (RDF-3X) and two parallel approaches (CliqueSquare and g-Store-D). "RDF_QDAG" surpasses existing systems when it comes to ensuring both scalability and performance
Costabello, Luca. "Contrôle d'accès et présentation contextuelle pour le Web des données". Phd thesis, Université Nice Sophia Antipolis, 2013. http://tel.archives-ouvertes.fr/tel-00934617.
Texto completoLeblay, Julien. "Techniques d'optimisation pour des données semi-structurées du web sémantique". Phd thesis, Université Paris Sud - Paris XI, 2013. http://tel.archives-ouvertes.fr/tel-00872883.
Texto completoDelanaux, Rémy. "Intégration de données liées respectueuse de la confidentialité". Thesis, Lyon, 2019. http://www.theses.fr/2019LYSE1303.
Texto completoIndividual privacy is a major and largely unexplored concern when publishing new datasets in the context of Linked Open Data (LOD). The LOD cloud forms a network of interconnected and publicly accessible datasets in the form of graph databases modeled using the RDF format and queried using the SPARQL language. This heavily standardized context is nowadays extensively used by academics, public institutions and some private organizations to make their data available. Yet, some industrial and private actors may be discouraged by potential privacy issues. To this end, we introduce and develop a declarative framework for privacy-preserving Linked Data publishing in which privacy and utility constraints are specified as policies, that is sets of SPARQL queries. Our approach is data-independent and only inspects the privacy and utility policies in order to determine the sequence of anonymization operations applicable to any graph instance for satisfying the policies. We prove the soundness of our algorithms and gauge their performance through experimental analysis. Another aspect to take into account is that a new dataset published to the LOD cloud is indeed exposed to privacy breaches due to the possible linkage to objects already existing in the other LOD datasets. In the second part of this thesis, we thus focus on the problem of building safe anonymizations of an RDF graph to guarantee that linking the anonymized graph with any external RDF graph will not cause privacy breaches. Given a set of privacy queries as input, we study the data-independent safety problem and the sequence of anonymization operations necessary to enforce it. We provide sufficient conditions under which an anonymization instance is safe given a set of privacy queries. Additionally, we show that our algorithms are robust in the presence of sameAs links that can be explicit or inferred by additional knowledge. To conclude, we evaluate the impact of this safety-preserving solution on given input graphs through experiments. We focus on the performance and the utility loss of this anonymization framework on both real-world and artificial data. We first discuss and select utility measures to compare the original graph to its anonymized counterpart, then define a method to generate new privacy policies from a reference one by inserting incremental modifications. We study the behavior of the framework on four carefully selected RDF graphs. We show that our anonymization technique is effective with reasonable runtime on quite large graphs (several million triples) and is gradual: the more specific the privacy policy is, the lesser its impact is. Finally, using structural graph-based metrics, we show that our algorithms are not very destructive even when privacy policies cover a large part of the graph. By designing a simple and efficient way to ensure privacy and utility in plausible usages of RDF graphs, this new approach suggests many extensions and in the long run more work on privacy-preserving data publishing in the context of Linked Open Data
Pomorski, Denis. "Apprentissage automatique symbolique/numérique : construction et évaluation d'un ensemble de règles à partir des données". Lille 1, 1991. http://www.theses.fr/1991LIL10117.
Texto completoHuang, Xin. "Querying big RDF data : semantic heterogeneity and rule-based inconsistency". Thesis, Sorbonne Paris Cité, 2016. http://www.theses.fr/2016USPCB124/document.
Texto completoSemantic Web is the vision of next generation of Web proposed by Tim Berners-Lee in 2001. Indeed, with the rapid development of Semantic Web technologies, large-scale RDF data already exist as linked open data, and their number is growing rapidly. Traditional Semantic Web querying and reasoning tools are designed to run in stand-alone environment. Therefor, Processing large-scale bulk data computation using traditional solutions will result in bottlenecks of memory space and computational performance inevitably. Large volumes of heterogeneous data are collected from different data sources by different organizations. In this context, different sources always exist inconsistencies and uncertainties which are difficult to identify and evaluate. To solve these challenges of Semantic Web, the main research contents and innovative approaches are proposed as follows. For these purposes, we firstly developed an inference based semantic entity resolution approach and linking mechanism when the same entity is provided in multiple RDF resources described using different semantics and URIs identifiers. We also developed a MapReduce based rewriting engine for Sparql query over big RDF data to handle the implicit data described intentionally by inference rules during query evaluation. The rewriting approach also deal with the transitive closure and cyclic rules to provide a rich inference language as RDFS and OWL. The second contribution concerns the distributed inconsistency processing. We extend the approach presented in first contribution by taking into account inconsistency in the data. This includes: (1)Rules based inconsistency detection with the help of our query rewriting engine; (2)Consistent query evaluation in three different semantics. The third contribution concerns the reasoning and querying over large-scale uncertain RDF data. We propose an MapReduce based approach to deal with large-scale reasoning with uncertainty. Unlike possible worlds semantic, we propose an algorithm for generating intensional Sparql query plan over probabilistic RDF graph for computing the probabilities of results within the query
Gillani, Syed. "Semantically-enabled stream processing and complex event processing over RDF graph streams". Thesis, Lyon, 2016. http://www.theses.fr/2016LYSES055/document.
Texto completoThere is a paradigm shift in the nature and processing means of today’s data: data are used to being mostly static and stored in large databases to be queried. Today, with the advent of new applications and means of collecting data, most applications on the Web and in enterprises produce data in a continuous manner under the form of streams. Thus, the users of these applications expect to process a large volume of data with fresh low latency results. This has resulted in the introduction of Data Stream Processing Systems (DSMSs) and a Complex Event Processing (CEP) paradigm – both with distinctive aims: DSMSs are mostly employed to process traditional query operators (mostly stateless), while CEP systems focus on temporal pattern matching (stateful operators) to detect changes in the data that can be thought of as events. In the past decade or so, a number of scalable and performance intensive DSMSs and CEP systems have been proposed. Most of them, however, are based on the relational data models – which begs the question for the support of heterogeneous data sources, i.e., variety of the data. Work in RDF stream processing (RSP) systems partly addresses the challenge of variety by promoting the RDF data model. Nonetheless, challenges like volume and velocity are overlooked by existing approaches. These challenges require customised optimisations which consider RDF as a first class citizen and scale the processof continuous graph pattern matching. To gain insights into these problems, this thesis focuses on developing scalable RDF graph stream processing, and semantically-enabled CEP systems (i.e., Semantic Complex Event Processing, SCEP). In addition to our optimised algorithmic and data structure methodologies, we also contribute to the design of a new query language for SCEP. Our contributions in these two fields are as follows: • RDF Graph Stream Processing. We first propose an RDF graph stream model, where each data item/event within streams is comprised of an RDF graph (a set of RDF triples). Second, we implement customised indexing techniques and data structures to continuously process RDF graph streams in an incremental manner. • Semantic Complex Event Processing. We extend the idea of RDF graph stream processing to enable SCEP over such RDF graph streams, i.e., temporalpattern matching. Our first contribution in this context is to provide a new querylanguage that encompasses the RDF graph stream model and employs a set of expressive temporal operators such as sequencing, kleene-+, negation, optional,conjunction, disjunction and event selection strategies. Based on this, we implement a scalable system that employs a non-deterministic finite automata model to evaluate these operators in an optimised manner. We leverage techniques from diverse fields, such as relational query optimisations, incremental query processing, sensor and social networks in order to solve real-world problems. We have applied our proposed techniques to a wide range of real-world and synthetic datasets to extract the knowledge from RDF structured data in motion. Our experimental evaluations confirm our theoretical insights, and demonstrate the viability of our proposed methods
Faucon, Jean-Christophe. "Etudes statistiques et des relations structure-écotoxicité appliquées aux données écotoxicologiques d'un ensemble hétérogène de substances nouvelles". Caen, 1998. http://www.theses.fr/1998CAEN4002.
Texto completoRaharjo, Agus Budi. "Reliability in ensemble learning and learning from crowds". Electronic Thesis or Diss., Aix-Marseille, 2019. http://www.theses.fr/2019AIXM0606.
Texto completoThe combination of several human expert labels is generally used to make reliable decisions. However, using humans or learning systems to improve the overall decision is a crucial problem. Indeed, several human experts or machine learning have not necessarily the same performance. Hence, a great effort is made to deal with this performance problem in the presence of several actors, i.e., humans or classifiers. In this thesis, we present the combination of reliable classifiers in ensemble learning and learning from crowds. The first contribution is a method, based on weighted voting, which allows selecting a reliable combination of classifications. Our algorithm RelMV transforms confidence scores, obtained during the training phase, into reliable scores. By using these scores, it determines a set of reliable candidates through both static and dynamic selection process. When it is hard to find expert labels as ground truth, we propose an approach based on Bayesian and expectation-maximization (EM) as our second contribution. The aim is to evaluate the reliability degree of each annotator and to aggregate the appropriate labels carefully. We optimize the computation time of the algorithm in order to adapt a large number of data collected from crowds. The obtained outcomes show better accuracy, stability, and computation time compared to the previous methods. Also, we conduct an experiment considering the melanoma diagnosis problem using a real-world medical dataset consisting of a set of skin lesions images, which is annotated by multiple dermatologists
Galarraga, Del Prado Luis. "Extraction des règles d'association dans des bases de connaissances". Thesis, Paris, ENST, 2016. http://www.theses.fr/2016ENST0050/document.
Texto completoThe continuous progress of information extraction (IE) techniques has led to the construction of large general-purpose knowledge bases (KBs). These KBs contain millions of computer-readable facts about real-world entities such as people, organizations and places. KBs are important nowadays because they allow computers to “understand” the real world. They are used in multiple applications in Information Retrieval, Query Answering and Automatic Reasoning, among other fields. Furthermore, the plethora of information available in today’s KBs allows for the discovery of frequent patterns in the data, a task known as rule mining. Such patterns or rules convey useful insights about the data. These rules can be used in several applications ranging from data analytics and prediction to data maintenance tasks. The contribution of this thesis is twofold : First, it proposes a method to mine rules on KBs. The method relies on a mining model tailored for potentially incomplete webextracted KBs. Second, the thesis shows the applicability of rule mining in several data-oriented tasks in KBs, namely facts prediction, schema alignment, canonicalization of (open) KBs and prediction of completeness
Nadal, Robert. "Analyse des données astronomiques contenues dans le "Commentaire" d'Hipparque : [thèse en partie soutenue sur un ensemble de travaux]". Toulouse 3, 1990. http://www.theses.fr/1990TOU30197.
Texto completoFeng, Wei. "Investigation of training data issues in ensemble classification based on margin concept : application to land cover mapping". Thesis, Bordeaux 3, 2017. http://www.theses.fr/2017BOR30016/document.
Texto completoClassification has been widely studied in machine learning. Ensemble methods, which build a classification model by integrating multiple component learners, achieve higher performances than a single classifier. The classification accuracy of an ensemble is directly influenced by the quality of the training data used. However, real-world data often suffers from class noise and class imbalance problems. Ensemble margin is a key concept in ensemble learning. It has been applied to both the theoretical analysis and the design of machine learning algorithms. Several studies have shown that the generalization performance of an ensemble classifier is related to the distribution of its margins on the training examples. This work focuses on exploiting the margin concept to improve the quality of the training set and therefore to increase the classification accuracy of noise sensitive classifiers, and to design effective ensemble classifiers that can handle imbalanced datasets. A novel ensemble margin definition is proposed. It is an unsupervised version of a popular ensemble margin. Indeed, it does not involve the class labels. Mislabeled training data is a challenge to face in order to build a robust classifier whether it is an ensemble or not. To handle the mislabeling problem, we propose an ensemble margin-based class noise identification and elimination method based on an existing margin-based class noise ordering. This method can achieve a high mislabeled instance detection rate while keeping the false detection rate as low as possible. It relies on the margin values of misclassified data, considering four different ensemble margins, including the novel proposed margin. This method is extended to tackle the class noise correction which is a more challenging issue. The instances with low margins are more important than safe samples, which have high margins, for building a reliable classifier. A novel bagging algorithm based on a data importance evaluation function relying again on the ensemble margin is proposed to deal with the class imbalance problem. In our algorithm, the emphasis is placed on the lowest margin samples. This method is evaluated using again four different ensemble margins in addressing the imbalance problem especially on multi-class imbalanced data. In remote sensing, where training data are typically ground-based, mislabeled training data is inevitable. Imbalanced training data is another problem frequently encountered in remote sensing. Both proposed ensemble methods involving the best margin definition for handling these two major training data issues are applied to the mapping of land covers
Galárraga, Del Prado Luis. "Extraction des règles d'association dans des bases de connaissances". Electronic Thesis or Diss., Paris, ENST, 2016. http://www.theses.fr/2016ENST0050.
Texto completoThe continuous progress of information extraction (IE) techniques has led to the construction of large general-purpose knowledge bases (KBs). These KBs contain millions of computer-readable facts about real-world entities such as people, organizations and places. KBs are important nowadays because they allow computers to “understand” the real world. They are used in multiple applications in Information Retrieval, Query Answering and Automatic Reasoning, among other fields. Furthermore, the plethora of information available in today’s KBs allows for the discovery of frequent patterns in the data, a task known as rule mining. Such patterns or rules convey useful insights about the data. These rules can be used in several applications ranging from data analytics and prediction to data maintenance tasks. The contribution of this thesis is twofold : First, it proposes a method to mine rules on KBs. The method relies on a mining model tailored for potentially incomplete webextracted KBs. Second, the thesis shows the applicability of rule mining in several data-oriented tasks in KBs, namely facts prediction, schema alignment, canonicalization of (open) KBs and prediction of completeness
Cao, Tien Duc. "Toward Automatic Fact-Checking of Statistic Claims". Thesis, Université Paris-Saclay (ComUE), 2019. http://www.theses.fr/2019SACLX051/document.
Texto completoDigital content is increasingly produced nowadays in a variety of media such as news and social network sites, personal Web sites, blogs etc. In particular, a large and dynamic part of such content is related to media-worthy events, whether of general interest (e.g., the war in Syria) or of specialized interest to a sub-community of users (e.g., sport events or genetically modified organisms). While such content is primarily meant for the human users (readers), interest is growing in its automatic analysis, understanding and exploitation. Within the ANR project ContentCheck, we are interested in developing textual and semantic tools for analyzing content shared through digital media. The proposed PhD project takes place within this contract, and will be developed based on the interactions with our partner from Le Monde. The PhD project aims at developing algorithms and tools for :Classifying and annotating mixed content (from articles, structured databases, social media etc.) based on an existing set of topics (or ontology) ;Information and relation extraction from a text which may comprise a statement to be fact-checked, with a particular focus on capturing the time dimension ; a sample statement is for instance « VAT on iron in France was the highest in Europe in 2015 ».Building structured queries from extracted information and relations, to be evaluated against reference databases used as trusted information against which facts can be checked
Symeonidou, Danai. "Automatic key discovery for Data Linking". Thesis, Paris 11, 2014. http://www.theses.fr/2014PA112265/document.
Texto completoIn the recent years, the Web of Data has increased significantly, containing a huge number of RDF triples. Integrating data described in different RDF datasets and creating semantic links among them, has become one of the most important goals of RDF applications. These links express semantic correspondences between ontology entities or data. Among the different kinds of semantic links that can be established, identity links express that different resources refer to the same real world entity. By comparing the number of resources published on the Web with the number of identity links, one can observe that the goal of building a Web of data is still not accomplished. Several data linking approaches infer identity links using keys. Nevertheless, in most datasets published on the Web, the keys are not available and it can be difficult, even for an expert, to declare them.The aim of this thesis is to study the problem of automatic key discovery in RDF data and to propose new efficient approaches to tackle this problem. Data published on the Web are usually created automatically, thus may contain erroneous information, duplicates or may be incomplete. Therefore, we focus on developing key discovery approaches that can handle datasets with numerous, incomplete or erroneous information. Our objective is to discover as many keys as possible, even ones that are valid in subparts of the data.We first introduce KD2R, an approach that allows the automatic discovery of composite keys in RDF datasets that may conform to different schemas. KD2R is able to treat datasets that may be incomplete and for which the Unique Name Assumption is fulfilled. To deal with the incompleteness of data, KD2R proposes two heuristics that offer different interpretations for the absence of data. KD2R uses pruning techniques to reduce the search space. However, this approach is overwhelmed by the huge amount of data found on the Web. Thus, we present our second approach, SAKey, which is able to scale in very large datasets by using effective filtering and pruning techniques. Moreover, SAKey is capable of discovering keys in datasets where erroneous data or duplicates may exist. More precisely, the notion of almost keys is proposed to describe sets of properties that are not keys due to few exceptions
Létourneau, François. "Analyse du potentiel de l'approche entrepôt de données pour l'intégration des métadonnées provenant d'un ensemble de géorépertoires disponibles sur Internet". Thesis, National Library of Canada = Bibliothèque nationale du Canada, 1998. http://www.collectionscanada.ca/obj/s4/f2/dsk2/tape17/PQDD_0007/MQ31752.pdf.
Texto completoLe, Brun Alexia. "Etude d'un ensemble de paramètres liés à la sécheresse de la peau : traitement des données par des méthodes d'analyses multidimensionnelles". Bordeaux 1, 1986. http://www.theses.fr/1986BOR10880.
Texto completoLEBRUN, ALEXIA MARIE. "Etude d'un ensemble de paramètres liés a la sécheresse de la peau : traitements des données par des méthodes d'analyses multidimensionnelles". Bordeaux 1, 1986. http://www.theses.fr/1986BOR10885.
Texto completoLe, Brun Alexia. "Étude d'un ensemble de paramètres liés à la sécheresse de la peau : traitements des données par des méthodes d'analyses multidimensionnelles". Bordeaux 1, 1986. http://www.theses.fr/1986BOR10689.
Texto completoSéraphin, John. "Réalisation d'un intranet : cohérence d'un ensemble réparti et communicant, autour d'une architecture réflexive". Paris 5, 1998. http://www.theses.fr/1998PA05S007.
Texto completo