Dissertations / Theses on the topic 'Linked Data Quality'

To see the other types of publications on this topic, follow the link: Linked Data Quality.

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 20 dissertations / theses for your research on the topic 'Linked Data Quality.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

SPAHIU, BLERINA. "Profiling Linked Data." Doctoral thesis, Università degli Studi di Milano-Bicocca, 2017. http://hdl.handle.net/10281/151645.

Full text
Abstract:
Nonostante l'elevato numero di dati pubblicati come LD, il loro utilizzo non ha ancora mostrato il loro potenziale per l’assenza di comprensione dei metadati. I consumatori di dati hanno bisogno di ottenere informazioni dai dataset in modo veloce e concentrato per poter decidere se sono utili per il loro problema oppure no. Le tecniche di profilazione dei dati offrono una soluzione efficace a questo problema in quanto sono utilizzati per generare metadati e statistiche che descrivono il contenuto dei dataset. Questa tesi presenta una ricerca, che affronta i problemi legati alla profilazione Linked Data. Nonostante il termine profilazione dei dati è usato in modo generico per diverse informazioni che descrivono i dataset, in questa tesi noi andiamo a ricoprire tre aspetti della profilazione; topic-based, schema-based e linkage-based. Il profilo proposto in questa tesi è fondamentale per il processo decisionale ed è la base dei requisiti che portano verso la comprensione dei dataset. In questa tesi presentiamo un approccio per classificare automaticamente insiemi di dati in una delle categorie utilizzate nel mondo dei LD. Inoltre, indaghiamo il problema della profilazione multi-topic. Per la profilazione schema-based proponiamo un approccio riassuntivo schema-based, che fornisce una panoramica sui rapporti nei dati. I nostri riassunti sono concisi e chiari sufficientemente per riassumere l'intero dataset. Inoltre, essi rivelano problemi di qualità e possono aiutare gli utenti nei compiti di formulazione dei query. Molti dataset nel LD cloud contengono informazioni simili per la stessa entità. Al fine di sfruttare appieno il suo potenziale LD bisogna far vedere questa informazione in modo esplicito. Profiling Linkage fornisce informazioni sul numero di entità equivalenti tra i dataset e rivela possibili errori.Le tecniche di profiling sviluppate durante questo lavoro sono automatiche e possono essere applicate a differenti insiemi di dati indipendentemente dal dominio.
Recently, the increasing diffusion of Linked Data (LD) as a standard way to publish and structure data on the Web has received a growing attention from researchers and data publishers. LD adoption is reflected in different domains such as government, media, life science, etc., building a powerful Web available to anyone. Despite the high number of datasets published as LD, their usage is still not exploited as they lack comprehensive metadata. Data consumers need to obtain information about datasets content in a fast and summarized form to decide if they are useful for their use case at hand or not. Data profiling techniques offer an efficient solution to this problem as they are used to generate metadata and statistics that describe the content of the dataset. Existing profiling techniques do no cover a wide range of use cases. Many challenges due to the heterogeneity nature of Linked Data are still to overcome. This thesis presents the doctoral research which tackles the problems related to Profiling Linked Data. Even though the term of data profiling is the umbrella term for diverse descriptive information that describes a dataset, in this thesis we cover three aspects of profiling; topic-based, schema-based and linkage-based. The profile provided in this thesis is fundamental for the decision-making process and is the basic requirement towards the dataset understanding. In this thesis we present an approach to automatically classify datasets in one of the topical categories used in the LD cloud. Moreover, we investigate the problem of multi-topic profiling. For the schema-based profiling we propose a schema-based summarization approach, that provides an overview about the relations in the data. Our summaries are concise and informative enough to summarize the whole dataset. Moreover, they reveal quality issues and can help users in the query formulation tasks. Many datasets in the LD cloud contain similar information for the same entity. In order to fully exploit its potential LD should made this information explicit. Linkage profiling provides information about the number of equivalent entities between datasets and reveal possible errors. The techniques of profiling developed during this work are automatic and can be applied to different datasets independently of the domain.
APA, Harvard, Vancouver, ISO, and other styles
2

Issa, Subhi. "Linked data quality : completeness and conciseness." Electronic Thesis or Diss., Paris, CNAM, 2019. http://www.theses.fr/2019CNAM1274.

Full text
Abstract:
La large diffusion des technologies du Web Sémantique telles que le Resource Description Framework (RDF) permet aux individus de construire leurs bases de données sur le Web, d'écrire des vocabulaires et de définir des règles pour organiser et expliquer les relations entre les données selon les principes des données liées. En conséquence, une grande quantité de données structurées et interconnectées est générée quotidiennement. Un examen attentif de la qualité de ces données pourrait s'avérer très critique, surtout si d'importantes recherches et décisions professionnelles en dépendent. La qualité des données liées est un aspect important pour indiquer leur aptitude à être utilisées dans des applications. Plusieurs dimensions permettant d'évaluer la qualité des données liées sont identifiées, telles que la précision, la complétude, la provenance et la concision. Cette thèse se concentre sur l'évaluation de la complétude et l'amélioration de la concision des données liées. En particulier, nous avons d'abord proposé une approche de calcul de complétude fondée sur un schéma généré. En effet, comme un schéma de référence est nécessaire pour évaluer la complétude, nous avons proposé une approche fondée sur la fouille de données pour obtenir un schéma approprié (c.-à-d. un ensemble de propriétés) à partir des données. Cette approche permet de distinguer les propriétés essentielles des propriétés marginales pour générer, pour un ensemble de données, un schéma conceptuel qui répond aux attentes de l'utilisateur quant aux contraintes de complétude des données. Nous avons implémenté un prototype appelé "LOD-CM" pour illustrer le processus de dérivation d'un schéma conceptuel d'un ensemble de données fondé sur les besoins de l'utilisateur. Nous avons également proposé une approche pour découvrir des prédicats équivalents afin d'améliorer la concision des données liées. Cette approche s'appuie, en plus d'une analyse statistique, sur une analyse sémantique approfondie des données et sur des algorithmes d'apprentissage. Nous soutenons que l'étude de la signification des prédicats peut aider à améliorer l'exactitude des résultats. Enfin, un ensemble d'expériences a été mené sur des ensembles de données réelles afin d'évaluer les approches que nous proposons
The wide spread of Semantic Web technologies such as the Resource Description Framework (RDF) enables individuals to build their databases on the Web, to write vocabularies, and define rules to arrange and explain the relationships between data according to the Linked Data principles. As a consequence, a large amount of structured and interlinked data is being generated daily. A close examination of the quality of this data could be very critical, especially, if important research and professional decisions depend on it. The quality of Linked Data is an important aspect to indicate their fitness for use in applications. Several dimensions to assess the quality of Linked Data are identified such as accuracy, completeness, provenance, and conciseness. This thesis focuses on assessing completeness and enhancing conciseness of Linked Data. In particular, we first proposed a completeness calculation approach based on a generated schema. Indeed, as a reference schema is required to assess completeness, we proposed a mining-based approach to derive a suitable schema (i.e., a set of properties) from data. This approach distinguishes between essential properties and marginal ones to generate, for a given dataset, a conceptual schema that meets the user's expectations regarding data completeness constraints. We implemented a prototype called “LOD-CM” to illustrate the process of deriving a conceptual schema of a dataset based on the user's requirements. We further proposed an approach to discover equivalent predicates to improve the conciseness of Linked Data. This approach is based, in addition to a statistical analysis, on a deep semantic analysis of data and on learning algorithms. We argue that studying the meaning of predicates can help to improve the accuracy of results. Finally, a set of experiments was conducted on real-world datasets to evaluate our proposed approaches
APA, Harvard, Vancouver, ISO, and other styles
3

RULA, ANISA. "Time-related quality dimensions in linked data." Doctoral thesis, Università degli Studi di Milano-Bicocca, 2014. http://hdl.handle.net/10281/81717.

Full text
Abstract:
Over the last few years, there has been an increasing di↵usion of Linked Data as a standard way to publish interlinked structured data on the Web, which allows users, and public and private organizations to fully exploit a large amount of data from several domains that were not available in the past. Although gathering and publishing such massive amount of structured data is certainly a step in the right direction, quality still poses a significant obstacle to the uptake of data consumption applications at large-scale. A crucial aspect of quality regards the dynamic nature of Linked Data where information can change rapidly and fail to reflect changes in the real world, thus becoming out-date. Quality is characterised by di↵erent dimensions that capture several aspects of quality such as accuracy, currency, consistency or completeness. In particular, the aspects of Linked Data dynamicity are captured by Time-Related Quality Dimen- sions such as data currency. The assessment of Time-Related Quality Dimensions, which is the task of measuring the quality, is based on temporal information whose collection poses several challenges regarding their availability, representation and diversity in Linked Data. The assessment of Time-Related Quality Dimensions supports data consumers in their decisions whether information are valid or not. The main goal of this thesis is to develop techniques for assessing Time-Related Quality Dimensions in Linked Data, which must overcome several challenges posed by Linked Data such as third-party applications, variety of data, high volume of data or velocity of data. The major contributions of this thesis can be summarized as follows: it presents a general settings of definitions for quality dimensions and measures adopted in Linked Data; it provides a large-scale analysis of approaches for representing temporal information in Linked Data; it provides a sharable and interoperable conceptual model which integrates vocabularies used to represent temporal information required for the assessment of Time-Related Quality Di- mensions; it proposes two domain-independent techniques to assess data currency that work with incomplete or inaccurate temporal information and finally it pro- vides an approach that enrich information with time intervals representing their temporal validity.
APA, Harvard, Vancouver, ISO, and other styles
4

Debattista, Jeremy [Verfasser]. "Scalable Quality Assessment of Linked Data / Jeremy Debattista." Bonn : Universitäts- und Landesbibliothek Bonn, 2017. http://d-nb.info/1135663440/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Baillie, Chris. "Reasoning about quality in the Web of Linked Data." Thesis, University of Aberdeen, 2015. http://digitool.abdn.ac.uk:80/webclient/DeliveryManager?pid=227177.

Full text
Abstract:
In recent years the Web has evolved from a collection of hyperlinked documents to a vast ecosystem of interconnected documents, devices, services, and agents. However, the open nature of the Web enables anyone or any thing to publish any content they choose. Therefore poor quality data can quickly propagate and an appropriate mechanism to assess the quality of such data is essential if agents are to identify reliable information for use in decision-making. Existing assessment frameworks investigate the context around data (additional information that describes the situation in which a datum was created). Such metadata can be made available by publishing information to the Web of Linked Data. However, there are situations in which examining context alone is not sufficient - such as when one must identify the agent responsible for data creation, or transformational processes applied to data. In these situations, examining data provenance is critical to identifying quality issues. Moreover, there will be situations in which an agent is unable to perform a quality assessment of their own. For example, if the original contextual metadata is no longer available. Here, it may be possible for agents to explore provenance of previous quality assessments and make decisions about quality result re-use. This thesis explores issues around quality assessment and provenance in the Web of Linked Data. It contributes a formal model of quality assessment designed to align with emerging standards for provenance on the Web. This model is then realised as an OWL ontology, which can be used as part of a software framework to perform data quality assessment. Through a number of real-world examples, spanning environmental sensing, invasive species monitoring, and passenger information domains, the thesis establishes the importance of examining provenance as part of quality assessment. Moreover, it demonstrates that by examining quality assessment provenance agents can make re-use decisions about existing quality assessment results. Included in these implementations are sets of example quality metrics that demonstrate how these can be encoded using the SPARQL Inferencing Notation (SPIN).
APA, Harvard, Vancouver, ISO, and other styles
6

Zaveri, Amrapali. "Linked Data Quality Assessment and its Application to Societal Progress Measurement." Doctoral thesis, Universitätsbibliothek Leipzig, 2015. http://nbn-resolving.de/urn:nbn:de:bsz:15-qucosa-167021.

Full text
Abstract:
In recent years, the Linked Data (LD) paradigm has emerged as a simple mechanism for employing the Web as a medium for data and knowledge integration where both documents and data are linked. Moreover, the semantics and structure of the underlying data are kept intact, making this the Semantic Web. LD essentially entails a set of best practices for publishing and connecting structure data on the Web, which allows publish- ing and exchanging information in an interoperable and reusable fashion. Many different communities on the Internet such as geographic, media, life sciences and government have already adopted these LD principles. This is confirmed by the dramatically growing Linked Data Web, where currently more than 50 billion facts are represented. With the emergence of Web of Linked Data, there are several use cases, which are possible due to the rich and disparate data integrated into one global information space. Linked Data, in these cases, not only assists in building mashups by interlinking heterogeneous and dispersed data from multiple sources but also empowers the uncovering of meaningful and impactful relationships. These discoveries have paved the way for scientists to explore the existing data and uncover meaningful outcomes that they might not have been aware of previously. In all these use cases utilizing LD, one crippling problem is the underlying data quality. Incomplete, inconsistent or inaccurate data affects the end results gravely, thus making them unreliable. Data quality is commonly conceived as fitness for use, be it for a certain application or use case. There are cases when datasets that contain quality problems, are useful for certain applications, thus depending on the use case at hand. Thus, LD consumption has to deal with the problem of getting the data into a state in which it can be exploited for real use cases. The insufficient data quality can be caused either by the LD publication process or is intrinsic to the data source itself. A key challenge is to assess the quality of datasets published on the Web and make this quality information explicit. Assessing data quality is particularly a challenge in LD as the underlying data stems from a set of multiple, autonomous and evolving data sources. Moreover, the dynamic nature of LD makes assessing the quality crucial to measure the accuracy of representing the real-world data. On the document Web, data quality can only be indirectly or vaguely defined, but there is a requirement for more concrete and measurable data quality metrics for LD. Such data quality metrics include correctness of facts wrt. the real-world, adequacy of semantic representation, quality of interlinks, interoperability, timeliness or consistency with regard to implicit information. Even though data quality is an important concept in LD, there are few methodologies proposed to assess the quality of these datasets. Thus, in this thesis, we first unify 18 data quality dimensions and provide a total of 69 metrics for assessment of LD. The first methodology includes the employment of LD experts for the assessment. This assessment is performed with the help of the TripleCheckMate tool, which was developed specifically to assist LD experts for assessing the quality of a dataset, in this case DBpedia. The second methodology is a semi-automatic process, in which the first phase involves the detection of common quality problems by the automatic creation of an extended schema for DBpedia. The second phase involves the manual verification of the generated schema axioms. Thereafter, we employ the wisdom of the crowds i.e. workers for online crowdsourcing platforms such as Amazon Mechanical Turk (MTurk) to assess the quality of DBpedia. We then compare the two approaches (previous assessment by LD experts and assessment by MTurk workers in this study) in order to measure the feasibility of each type of the user-driven data quality assessment methodology. Additionally, we evaluate another semi-automated methodology for LD quality assessment, which also involves human judgement. In this semi-automated methodology, selected metrics are formally defined and implemented as part of a tool, namely R2RLint. The user is not only provided the results of the assessment but also specific entities that cause the errors, which help users understand the quality issues and thus can fix them. Finally, we take into account a domain-specific use case that consumes LD and leverages on data quality. In particular, we identify four LD sources, assess their quality using the R2RLint tool and then utilize them in building the Health Economic Research (HER) Observatory. We show the advantages of this semi-automated assessment over the other types of quality assessment methodologies discussed earlier. The Observatory aims at evaluating the impact of research development on the economic and healthcare performance of each country per year. We illustrate the usefulness of LD in this use case and the importance of quality assessment for any data analysis.
APA, Harvard, Vancouver, ISO, and other styles
7

YAMAN, BEYZA. "Exploiting Context-Dependent Quality Metadata for Linked Data Source Selection." Doctoral thesis, Università degli studi di Genova, 2018. http://hdl.handle.net/11567/930633.

Full text
Abstract:
The traditional Web is evolving into the Web of Data which consists of huge collections of structured data over poorly controlled distributed data sources. Live queries are needed to get current information out of this global data space. In live query processing, source selection deserves attention since it allows us to identify the sources which might likely contain the relevant data. The thesis proposes a source selection technique in the context of live query processing on Linked Open Data, which takes into account the context of the request and the quality of data contained in the sources to enhance the relevance (since the context enables a better interpretation of the request) and the quality of the answers (which will be obtained by processing the request on the selected sources). Specifically, the thesis proposes an extension of the QTree indexing structure that had been proposed as a data summary to support source selection based on source content, to take into account quality and contextual information. With reference to a specific case study, the thesis also contributes an approach, relying on the Luzzu framework, to assess the quality of a source with respect to for a given context (according to different quality dimensions). An experimental evaluation of the proposed techniques is also provided
APA, Harvard, Vancouver, ISO, and other styles
8

Salibekyan, Zinaida. "Trends in job quality : evidence from French and British linked employer-employee data." Thesis, Aix-Marseille, 2016. http://www.theses.fr/2016AIXM2001.

Full text
Abstract:
La contribution de cette thèse est d’examiner l’évolution de la qualité de l’emploi du point de vue de l’établissement. Elle s’appuie sur des données couplées employeurs - salariés issues des enquêtes comparables Workplace Employment Relations Survey (WERS 2004 et 2011) pour le cas de la Grande-Bretagne et Relations Professionnelles et Négociations d’Entreprise (REPONSE 2005 et 2011) pour la France. Cette thèse contient trois chapitres et enrichit trois grands axes de la littérature existante. Le premier chapitre explore l’impact des pratiques d’ajustement au niveau de l’établissement sur la qualité de l’emploi en France pendant la crise. Le deuxième chapitre analyse le rôle du régime institutionnel en France et en Grande-Bretagne afin d’expliquer la variation de la qualité de l’emploi entre les deux pays. Finalement, le troisième chapitre examine les stratégies adoptées par les salariés pour composer avec leur salaire et leurs conditions de travail
The contribution of this thesis is to examine the evolution of job quality from the perspective of the workplace using the British Workplace Employment Relations Surveys (WERS 2004 and 2011) and the French Enquête Relations Professionnelles et Négociations d’Entreprises (REPONSE 2005 and 2011). The thesis consists of three chapters and complements three main strands of the existing literature. The first chapter explores the impact of workplace adjustment practices on job quality in France during the recession. The second chapter examines the role of institutional regimes in Great Britain and France in explaining the cross-national variation in job quality. Finally, the third chapter investigates the strategies employees adopt in order to cope with their pay and working conditions
APA, Harvard, Vancouver, ISO, and other styles
9

Melo, Jessica Oliveira de Souza Ferreira [UNESP]. "Metodologia de avaliação de qualidade de dados no contexto do linked data." Universidade Estadual Paulista (UNESP), 2017. http://hdl.handle.net/11449/150870.

Full text
Abstract:
Submitted by JESSICA OLIVEIRA DE SOUZA null (osz.jessica@gmail.com) on 2017-06-09T12:04:24Z No. of bitstreams: 1 Dissertação-Jessica-Melo.pdf: 5257476 bytes, checksum: 21d6468b47635a4df09d971c6c0bb581 (MD5)
Approved for entry into archive by LUIZA DE MENEZES ROMANETTO (luizamenezes@reitoria.unesp.br) on 2017-06-12T12:21:39Z (GMT) No. of bitstreams: 1 melo_josf_me_mar.pdf: 5257476 bytes, checksum: 21d6468b47635a4df09d971c6c0bb581 (MD5)
Made available in DSpace on 2017-06-12T12:21:39Z (GMT). No. of bitstreams: 1 melo_josf_me_mar.pdf: 5257476 bytes, checksum: 21d6468b47635a4df09d971c6c0bb581 (MD5) Previous issue date: 2017-05-09
Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)
A Web Semântica sugere a utilização de padrões e tecnologias que atribuem estrutura e semântica aos dados, de modo que agentes computacionais possam fazer um processamento inteligente, automático, para cumprir tarefas específicas. Neste contexto, foi criado o projeto Linked Open Data (LOD), que consiste em uma iniciativa para promover a publicação de dados linkados (Linked Data). Com o evidente crescimento dos dados publicados como Linked Data, a qualidade tornou-se essencial para que tais conjuntos de dados (datasets) atendam os objetivos básicos da Web Semântica. Isso porque problemas de qualidade nos datasets publicados constituem em um empecilho não somente para a sua utilização, mas também para aplicações que fazem uso de tais dados. Considerando que os dados disponibilizados como Linked Data possibilitam um ambiente favorável para aplicações inteligentes, problemas de qualidade podem também dificultar ou impedir a integração dos dados provenientes de diferentes datasets. A literatura aplica diversas dimensões de qualidade no contexto do Linked Data, porém indaga-se a aplicabilidade de tais dimensões para avaliação de qualidade de dados linkados. Deste modo, esta pesquisa tem como objetivo propor uma metodologia para avaliação de qualidade nos datasets de Linked Data, bem como estabelecer um modelo do que pode ser considerado qualidade de dados no contexto da Web Semântica e do Linked Data. Para isso adotou-se uma abordagem exploratória e descritiva a fim de estabelecer problemas, dimensões e requisitos de qualidade e métodos quantitativos na metodologia de avaliação a fim de realizar a atribuição de índices de qualidade. O trabalho resultou na definição de sete dimensões de qualidade aplicáveis ao domínio do Linked Data e 14 fórmulas diferentes para a quantificação da qualidade de datasets sobre publicações científicas. Por fim realizou-se uma prova de conceito na qual a metodologia de avaliação de qualidade proposta foi aplicada em um dataset promovido pelo LOD. Conclui-se, a partir dos resultados da prova de conceito, que a metodologia proposta consiste em um meio viável para quantificação dos problemas de qualidade em datasets de Linked Data, e que apesar dos diversos requisitos para a publicação deste tipo de dados podem existir outros datasets que não atendam determinados requisitos de qualidade, e por sua vez, não deveriam estar inclusos no diagrama do projeto LOD.
The Semantic Web suggests the use of patterns and technologies that assign structure and semantics to the data, so that computational agents can perform intelligent, automatic processing to accomplish specific tasks. In this context, the Linked Open Data (LOD) project was created, which consists of an initiative to promote the publication of Linked Data. With the evident growth of data published as Linked Data, quality has become essential for such datasets to meet the basic goals of the Semantic Web. This is because quality problems in published datasets are a hindrance not only to their use but also to applications that make use of such data. Considering that data made available as Linked Data enables a favorable environment for intelligent applications, quality problems can also hinder or prevent the integration of data coming from different datasets. The literature applies several quality dimensions in the context of Linked Data, however, the applicability of such dimensions for quality evaluation of linked data is investigated. Thus, this research aims to propose a methodology for quality evaluation in Linked Data datasets, as well as to establish a model of what can be considered data quality in the Semantic Web and Linked Data context. For this, an exploratory and descriptive approach was adopted in order to establish problems, dimensions and quality requirements and quantitative methods in the evaluation methodology in order to perform the assignment of quality indexes. This work resulted in the definition of seven quality dimensions applicable to the Linked Data domain and 14 different formulas for the quantification of the quality of datasets on scientific publications. Finally, a proof of concept was developed in which the proposed quality assessment methodology was applied in a dataset promoted by the LOD. It is concluded from the proof of concept results that the proposed methodology consists of a viable means for quantification of quality problems in Linked Data datasets and that despite the diverse requirements for the publication of this type of data there may be other datasets that do not meet certain quality requirements, and in turn, should not be included in the LOD project diagram.
APA, Harvard, Vancouver, ISO, and other styles
10

Zaveri, Amrapali [Verfasser], and Felix [Gutachter] Naumann. "Linked Data Quality Assessment and its Application to Societal Progress Measurement / Amrapali Zaveri ; Gutachter: Felix Naumann." Leipzig : Universitätsbibliothek Leipzig, 2015. http://d-nb.info/1239565844/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
11

Tomčová, Lucie. "Datová kvalita v prostředí otevřených a propojitelných dat." Master's thesis, Vysoká škola ekonomická v Praze, 2014. http://www.nusl.cz/ntk/nusl-192414.

Full text
Abstract:
The master thesis deals with data quality in the context of open and linked data. One of the goals is to define specifics of data quality in this context. The specifics are perceived mainly with orientation to data quality dimensions (i. e. data characteristics which we study in data quality) and possibilities of their measurement. The thesis also defines the effect on data quality that is connected with data transformation to linked data; the effect if defined with consideration to possible risks and benefits that can influence data quality. The list of metrics verified on real data (open linked data published by government institution) is composed for the data quality dimensions that are considered to be relevant in context of open and linked data. The thesis points to the need of recognition of differences that are specific in this context when assessing and managing data quality. At the same time, it offers possibilities for further study of this question and it presents subsequent directions for both theoretical and practical evolution of the topic.
APA, Harvard, Vancouver, ISO, and other styles
12

Beretta, Valentina. "évaluation de la véracité des données : améliorer la découverte de la vérité en utilisant des connaissances a priori." Thesis, IMT Mines Alès, 2018. http://www.theses.fr/2018EMAL0002/document.

Full text
Abstract:
Face au danger de la désinformation et de la prolifération de fake news (fausses nouvelles) sur le Web, la notion de véracité des données constitue un enjeu crucial. Dans ce contexte, il devient essentiel de développer des modèles qui évaluent de manière automatique la véracité des informations. De fait, cette évaluation est déjà très difficile pour un humain, en raison notamment du biais de confirmation qui empêche d’évaluer objectivement la fiabilité des informations. De plus, la quantité d'informations disponibles sur le Web rend cette tâche quasiment impossible. Il est donc nécessaire de disposer d'une grande puissance de calcul et de développer des méthodes capables d'automatiser cette tâche.Dans cette thèse, nous nous concentrons sur les modèles de découverte de la vérité. Ces approches analysent les assertions émises par différentes sources afin de déterminer celle qui est la plus fiable et digne de confiance. Cette étape est cruciale dans un processus d'extraction de connaissances, par exemple, pour constituer des bases de qualité, sur lesquelles pourront s'appuyer différents traitements ultérieurs (aide à la décision, recommandation, raisonnement…). Plus précisément, les modèles de la littérature sont des modèles non supervisés qui reposent sur un postulat : les informations exactes sont principalement fournies par des sources fiables et des sources fiables fournissent des informations exactes.Les approches existantes faisaient jusqu'ici abstraction de la connaissance a priori d'un domaine. Dans cette contribution, nous montrons comment les modèles de connaissance (ontologies de domaine) peuvent avantageusement être exploités pour améliorer les processus de recherche de vérité. Nous insistons principalement sur deux approches : la prise en compte de la hiérarchisation des concepts de l'ontologie et l'identification de motifs dans les connaissances qui permet, en exploitant certaines règles d'association, de renforcer la confiance dans certaines assertions. Dans le premier cas, deux valeurs différentes ne seront plus nécessairement considérées comme contradictoires ; elles peuvent, en effet, représenter le même concept mais avec des niveaux de détail différents. Pour intégrer cette composante dans les approches existantes, nous nous basons sur les modèles mathématiques associés aux ordres partiels. Dans le second cas, nous considérons des modèles récurrents (modélisés en utilisant des règles d'association) qui peuvent être dérivés à partir des ontologies et de bases de connaissances existantes. Ces informations supplémentaires peuvent renforcer la confiance dans certaines valeurs lorsque certains schémas récurrents sont observés. Chaque approche est validée sur différents jeux de données qui sont rendus disponibles à la communauté, tout comme le code de calcul correspondant aux deux approches
The notion of data veracity is increasingly getting attention due to the problem of misinformation and fake news. With more and more published online information it is becoming essential to develop models that automatically evaluate information veracity. Indeed, the task of evaluating data veracity is very difficult for humans. They are affected by confirmation bias that prevents them to objectively evaluate the information reliability. Moreover, the amount of information that is available nowadays makes this task time-consuming. The computational power of computer is required. It is critical to develop methods that are able to automate this task.In this thesis we focus on Truth Discovery models. These approaches address the data veracity problem when conflicting values about the same properties of real-world entities are provided by multiple sources.They aim to identify which are the true claims among the set of conflicting ones. More precisely, they are unsupervised models that are based on the rationale stating that true information is provided by reliable sources and reliable sources provide true information. The main contribution of this thesis consists in improving Truth Discovery models considering a priori knowledge expressed in ontologies. This knowledge may facilitate the identification of true claims. Two particular aspects of ontologies are considered. First of all, we explore the semantic dependencies that may exist among different values, i.e. the ordering of values through certain conceptual relationships. Indeed, two different values are not necessary conflicting. They may represent the same concept, but with different levels of detail. In order to integrate this kind of knowledge into existing approaches, we use the mathematical models of partial order. Then, we consider recurrent patterns that can be derived from ontologies. This additional information indeed reinforces the confidence in certain values when certain recurrent patterns are observed. In this case, we model recurrent patterns using rules. Experiments that were conducted both on synthetic and real-world datasets show that a priori knowledge enhances existing models and paves the way towards a more reliable information world. Source code as well as synthetic and real-world datasets are freely available
APA, Harvard, Vancouver, ISO, and other styles
13

Arndt, Natanael, Kurt Junghanns, Roy Meissner, Philipp Frischmuth, Norman Radtke, Marvin Frommhold, and Michael Martin. "Structured feedback: a distributed protocol for feedback and patches on the Web of Data." Universität Leipzig, 2016. https://ul.qucosa.de/id/qucosa%3A15779.

Full text
Abstract:
The World Wide Web is an infrastructure to publish and retrieve information through web resources. It evolved from a static Web 1.0 to a multimodal and interactive communication and information space which is used to collaboratively contribute and discuss web resources, which is better known as Web 2.0. The evolution into a Semantic Web (Web 3.0) proceeds. One of its remarkable advantages is the decentralized and interlinked data composition. Hence, in contrast to its data distribution, workflows and technologies for decentralized collaborative contribution are missing. In this paper we propose the Structured Feedback protocol as an interactive addition to the Web of Data. It offers support for users to contribute to the evolution of web resources, by providing structured data artifacts as patches for web resources, as well as simple plain text comments. Based on this approach it enables crowd-supported quality assessment and web data cleansing processes in an ad-hoc fashion most web users are familiar with.
APA, Harvard, Vancouver, ISO, and other styles
14

Maillot, Pierre. "Nouvelles méthodes pour l'évaluation, l'évolution et l'interrogation des bases du Web des données." Thesis, Angers, 2015. http://www.theses.fr/2015ANGE0007/document.

Full text
Abstract:
Le Web des données offre un environnement de partage et de diffusion des données, selon un cadre particulier qui permet une exploitation des données tant par l’humain que par la machine. Pour cela, le framework RDF propose de formater les données en phrases élémentaires de la forme (sujet, relation, objet) , appelées triplets. Les bases du Web des données, dites bases RDF, sont des ensembles de triplets. Dans une base RDF, l’ontologie – données structurelles – organise la description des données factuelles. Le nombre et la taille des bases du Web des données n’a pas cessé de croître depuis sa création en 2001. Cette croissance s’est même accélérée depuis l’apparition du mouvement du Linked Data en 2008 qui encourage le partage et l’interconnexion de bases publiquement accessibles sur Internet. Ces bases couvrent des domaines variés tels que les données encyclopédiques (e.g. Wikipédia), gouvernementales ou bibliographiques. L’utilisation et la mise à jour des données dans ces bases sont faits par des communautés d’utilisateurs liés par un domaine d’intérêt commun. Cette exploitation communautaire se fait avec le soutien d’outils insuffisamment matures pour diagnostiquer le contenu d’une base ou pour interroger ensemble les bases du Web des données. Notre thèse propose trois méthodes pour encadrer le développement, tant factuel qu’ontologique, et pour améliorer l’interrogation des bases du Web des données. Nous proposons d’abord une méthode pour évaluer la qualité des modifications des données factuelles lors d’une mise à jour par un contributeur. Nous proposons ensuite une méthode pour faciliter l’examen de la base par la mise en évidence de groupes de données factuelles en conflit avec l’ontologie. L’expert qui guide l’évolution de cette base peut ainsi modifier l’ontologie ou les données. Nous proposons enfin une méthode d’interrogation dans un environnement distribué qui interroge uniquement les bases susceptibles de fournir une réponse
The web of data is a mean to share and broadcast data user-readable data as well as machine-readable data. This is possible thanks to rdf which propose the formatting of data into short sentences (subject, relation, object) called triples. Bases from the web of data, called rdf bases, are sets of triples. In a rdf base, the ontology – structural data – organize the description of factual data. Since the web of datacreation in 2001, the number and sizes of rdf bases have been constantly rising. This increase has accelerated since the apparition of linked data, which promote the sharing and interlinking of publicly available bases by user communities. The exploitation – interrogation and edition – by theses communities is made without adequateSolution to evaluate the quality of new data, check the current state of the bases or query together a set of bases. This thesis proposes three methods to help the expansion at factual and ontological level and the querying of bases from the web ofData. We propose a method designed to help an expert to check factual data in conflict with the ontology. Finally we propose a method for distributed querying limiting the sending of queries to bases that may contain answers
APA, Harvard, Vancouver, ISO, and other styles
15

Gängler, Thomas. "Semantic Federation of Musical and Music-Related Information for Establishing a Personal Music Knowledge Base." Master's thesis, Saechsische Landesbibliothek- Staats- und Universitaetsbibliothek Dresden, 2011. http://nbn-resolving.de/urn:nbn:de:bsz:14-qucosa-72434.

Full text
Abstract:
Music is perceived and described very subjectively by every individual. Nowadays, people often get lost in their steadily growing, multi-placed, digital music collection. Existing music player and management applications get in trouble when dealing with poor metadata that is predominant in personal music collections. There are several music information services available that assist users by providing tools for precisely organising their music collection, or for presenting them new insights into their own music library and listening habits. However, it is still not the case that music consumers can seamlessly interact with all these auxiliary services directly from the place where they access their music individually. To profit from the manifold music and music-related knowledge that is or can be available via various information services, this information has to be gathered up, semantically federated, and integrated into a uniform knowledge base that can personalised represent this data in an appropriate visualisation to the users. This personalised semantic aggregation of music metadata from several sources is the gist of this thesis. The outlined solution particularly concentrates on users’ needs regarding music collection management which can strongly alternate between single human beings. The author’s proposal, the personal music knowledge base (PMKB), consists of a client-server architecture with uniform communication endpoints and an ontological knowledge representation model format that is able to represent the versatile information of its use cases. The PMKB concept is appropriate to cover the complete information flow life cycle, including the processes of user account initialisation, information service choice, individual information extraction, and proactive update notification. The PMKB implementation makes use of SemanticWeb technologies. Particularly the knowledge representation part of the PMKB vision is explained in this work. Several new Semantic Web ontologies are defined or existing ones are massively modified to meet the requirements of a personalised semantic federation of music and music-related data for managing personal music collections. The outcome is, amongst others, • a new vocabulary for describing the play back domain, • another one for representing information service categorisations and quality ratings, and • one that unites the beneficial parts of the existing advanced user modelling ontologies. The introduced vocabularies can be perfectly utilised in conjunction with the existing Music Ontology framework. Some RDFizers that also make use of the outlined ontologies in their mapping definitions, illustrate the fitness in practise of these specifications. A social evaluation method is applied to carry out an examination dealing with the reutilisation, application and feedback of the vocabularies that are explained in this work. This analysis shows that it is a good practise to properly publish Semantic Web ontologies with the help of some Linked Data principles and further basic SEO techniques to easily reach the searching audience, to avoid duplicates of such KR specifications, and, last but not least, to directly establish a \"shared understanding\". Due to their project-independence, the proposed vocabularies can be deployed in every knowledge representation model that needs their knowledge representation capacities. This thesis added its value to make the vision of a personal music knowledge base come true.
APA, Harvard, Vancouver, ISO, and other styles
16

Michelfeit, Jan. "Integrade Linked Data." Master's thesis, 2013. http://www.nusl.cz/ntk/nusl-321378.

Full text
Abstract:
Linked Data have emerged as a successful publication format which could mean to structured data what Web meant to documents. The strength of Linked Data is in its fitness for integration of data from multiple sources. Linked Data integration opens door to new opportunities but also poses new challenges. New algorithms and tools need to be developed to cover all steps of data integration. This thesis examines the established data integration proceses and how they can be applied to Linked Data, with focus on data fusion and conflict resolution. Novel algorithms for Linked Data fusion are proposed and the task of supporting trust with provenance information and quality assessment of fused data is addressed. The proposed algorithms are implemented as part of a Linked Data integration framework ODCleanStore.
APA, Harvard, Vancouver, ISO, and other styles
17

Zaveri, Amrapali. "Linked Data Quality Assessment and its Application to Societal Progress Measurement." Doctoral thesis, 2014. https://ul.qucosa.de/id/qucosa%3A13295.

Full text
Abstract:
In recent years, the Linked Data (LD) paradigm has emerged as a simple mechanism for employing the Web as a medium for data and knowledge integration where both documents and data are linked. Moreover, the semantics and structure of the underlying data are kept intact, making this the Semantic Web. LD essentially entails a set of best practices for publishing and connecting structure data on the Web, which allows publish- ing and exchanging information in an interoperable and reusable fashion. Many different communities on the Internet such as geographic, media, life sciences and government have already adopted these LD principles. This is confirmed by the dramatically growing Linked Data Web, where currently more than 50 billion facts are represented. With the emergence of Web of Linked Data, there are several use cases, which are possible due to the rich and disparate data integrated into one global information space. Linked Data, in these cases, not only assists in building mashups by interlinking heterogeneous and dispersed data from multiple sources but also empowers the uncovering of meaningful and impactful relationships. These discoveries have paved the way for scientists to explore the existing data and uncover meaningful outcomes that they might not have been aware of previously. In all these use cases utilizing LD, one crippling problem is the underlying data quality. Incomplete, inconsistent or inaccurate data affects the end results gravely, thus making them unreliable. Data quality is commonly conceived as fitness for use, be it for a certain application or use case. There are cases when datasets that contain quality problems, are useful for certain applications, thus depending on the use case at hand. Thus, LD consumption has to deal with the problem of getting the data into a state in which it can be exploited for real use cases. The insufficient data quality can be caused either by the LD publication process or is intrinsic to the data source itself. A key challenge is to assess the quality of datasets published on the Web and make this quality information explicit. Assessing data quality is particularly a challenge in LD as the underlying data stems from a set of multiple, autonomous and evolving data sources. Moreover, the dynamic nature of LD makes assessing the quality crucial to measure the accuracy of representing the real-world data. On the document Web, data quality can only be indirectly or vaguely defined, but there is a requirement for more concrete and measurable data quality metrics for LD. Such data quality metrics include correctness of facts wrt. the real-world, adequacy of semantic representation, quality of interlinks, interoperability, timeliness or consistency with regard to implicit information. Even though data quality is an important concept in LD, there are few methodologies proposed to assess the quality of these datasets. Thus, in this thesis, we first unify 18 data quality dimensions and provide a total of 69 metrics for assessment of LD. The first methodology includes the employment of LD experts for the assessment. This assessment is performed with the help of the TripleCheckMate tool, which was developed specifically to assist LD experts for assessing the quality of a dataset, in this case DBpedia. The second methodology is a semi-automatic process, in which the first phase involves the detection of common quality problems by the automatic creation of an extended schema for DBpedia. The second phase involves the manual verification of the generated schema axioms. Thereafter, we employ the wisdom of the crowds i.e. workers for online crowdsourcing platforms such as Amazon Mechanical Turk (MTurk) to assess the quality of DBpedia. We then compare the two approaches (previous assessment by LD experts and assessment by MTurk workers in this study) in order to measure the feasibility of each type of the user-driven data quality assessment methodology. Additionally, we evaluate another semi-automated methodology for LD quality assessment, which also involves human judgement. In this semi-automated methodology, selected metrics are formally defined and implemented as part of a tool, namely R2RLint. The user is not only provided the results of the assessment but also specific entities that cause the errors, which help users understand the quality issues and thus can fix them. Finally, we take into account a domain-specific use case that consumes LD and leverages on data quality. In particular, we identify four LD sources, assess their quality using the R2RLint tool and then utilize them in building the Health Economic Research (HER) Observatory. We show the advantages of this semi-automated assessment over the other types of quality assessment methodologies discussed earlier. The Observatory aims at evaluating the impact of research development on the economic and healthcare performance of each country per year. We illustrate the usefulness of LD in this use case and the importance of quality assessment for any data analysis.
APA, Harvard, Vancouver, ISO, and other styles
18

Kadleček, Rastislav. "Transformace HTML dat o produktech do Linked Data formátu." Master's thesis, 2018. http://www.nusl.cz/ntk/nusl-387272.

Full text
Abstract:
In order to make a step towards the idea of the Semantic Web it is necessary to research ways how to retrieve semantic information from documents published on the current Web 2.0. As an answer to growing amount of data published in a form of relational tables, the Odalic system, based on the extended TableMiner+ Semantic Table Interpretation algorithm was introduced to provide a convenient way to semantize tabular data using knowledge base disambiguation process. The goal of this thesis is to propose an extended algorithm for the Odalic system, which would allow the system to gather semantic information for tabular data describing products from e-shops, which have very limited presence in the knowl- edge bases. This should be achieved by using a machine learning technique called classification. This thesis consists of several parts - obtaining and preprocessing of the product data from e-shops, evaluation of several classification algorithms in order to select the best-performing one, description of design and implementation of the extended Odalic algorithm, description of its integration into the Odalic system, evaluation of the improved algorithm using the obtained product data and semantization of the product data using the new Odalic algorithm. In the end, the results are concluded and possible...
APA, Harvard, Vancouver, ISO, and other styles
19

"MOOCLink: Linking and Maintaining Quality of Data Provided by Various MOOC Providers." Master's thesis, 2016. http://hdl.handle.net/2286/R.I.39445.

Full text
Abstract:
abstract: The concept of Linked Data is gaining widespread popularity and importance. The method of publishing and linking structured data on the web is called Linked Data. Emergence of Linked Data has made it possible to make sense of huge data, which is scattered all over the web, and link multiple heterogeneous sources. This leads to the challenge of maintaining the quality of Linked Data, i.e., ensuring outdated data is removed and new data is included. The focus of this thesis is devising strategies to effectively integrate data from multiple sources, publish it as Linked Data, and maintain the quality of Linked Data. The domain used in the study is online education. With so many online courses offered by Massive Open Online Courses (MOOC), it is becoming increasingly difficult for an end user to gauge which course best fits his/her needs. Users are spoilt for choices. It would be very helpful for them to make a choice if there is a single place where they can visually compare the offerings of various MOOC providers for the course they are interested in. Previous work has been done in this area through the MOOCLink project that involved integrating data from Coursera, EdX, and Udacity and generation of linked data, i.e. Resource Description Framework (RDF) triples. The research objective of this thesis is to determine a methodology by which the quality of data available through the MOOCLink application is maintained, as there are lots of new courses being constantly added and old courses being removed by data providers. This thesis presents the integration of data from various MOOC providers and algorithms for incrementally updating linked data to maintain their quality and compare it against a naïve approach in order to constantly keep the users engaged with up-to-date data. A master threshold value was determined through experiments and analysis that quantifies one algorithm being better than the other in terms of time efficiency. An evaluation of the tool shows the effectiveness of the algorithms presented in this thesis.
Dissertation/Thesis
Masters Thesis Computer Science 2016
APA, Harvard, Vancouver, ISO, and other styles
20

Kontokostas, Dimitrios. "Large-Scale Multilingual Knowledge Extraction, Publishing and Quality Assessment: The case of DBpedia." 2017. https://ul.qucosa.de/id/qucosa%3A31447.

Full text
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography