Log in

Relevant bibliographies by topics / Data quality / Dissertations / Theses

Dissertations / Theses on the topic 'Data quality'

To see the other types of publications on this topic, follow the link: Data quality.

Author: Grafiati

Published: 4 June 2021

Last updated: 20 July 2024

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 dissertations / theses for your research on the topic 'Data quality.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Grillo, Aderibigbe. "Developing a data quality scorecard that measures data quality in a data warehouse." Thesis, Brunel University, 2018. http://bura.brunel.ac.uk/handle/2438/17137.

Full text

Abstract:

The main purpose of this thesis is to develop a data quality scorecard (DQS) that aligns the data quality needs of the Data warehouse stakeholder group with selected data quality dimensions. To comprehend the research domain, a general and systematic literature review (SLR) was carried out, after which the research scope was established. Using Design Science Research (DSR) as the methodology to structure the research, three iterations were carried out to achieve the research aim highlighted in this thesis. In the first iteration, as DSR was used as a paradigm, the artefact was build from the results of the general and systematic literature review conduct. A data quality scorecard (DQS) was conceptualised. The result of the SLR and the recommendations for designing an effective scorecard provided the input for the development of the DQS. Using a System Usability Scale (SUS), to validate the usability of the DQS, the results of the first iteration suggest that the DW stakeholders found the DQS useful. The second iteration was conducted to further evaluate the DQS through a run through in the FMCG domain and then conducting a semi-structured interview. The thematic analysis of the semi-structured interviews demonstrated that the stakeholder's participants' found the DQS to be transparent; an additional reporting tool; Integrates; easy to use; consistent; and increases confidence in the data. However, the timeliness data dimension was found to be redundant, necessitating a modification to the DQS. The third iteration was conducted with similar steps as the second iteration but with the modified DQS in the oil and gas domain. The results from the third iteration suggest that DQS is a useful tool that is easy to use on a daily basis. The research contributes to theory by demonstrating a novel approach to DQS design This was achieved by ensuring the design of the DQS aligns with the data quality concern areas of the DW stakeholders and the data quality dimensions. Further, this research lay a good foundation for the future by establishing a DQS model that can be used as a base for further development.

APA, Harvard, Vancouver, ISO, and other styles

2

Sýkorová, Veronika. "Data Quality Metrics." Master's thesis, Vysoká škola ekonomická v Praze, 2008. http://www.nusl.cz/ntk/nusl-2815.

Full text

Abstract:

The aim of the thesis is to prove measurability of the Data Quality which is a relatively subjective measure and thus is difficult to measure. In doing this various aspects of measuring the quality of data are analyzed and a Complex Data Quality Monitoring System is introduced with the aim to provide a concept for measuring/monitoring the overall Data Quality in an organization. The system is built on a metrics hierarchy decomposed into particular detailed metrics, dimensions enabling multidimensional analyses of the metrics, and processes being measured by the metrics. The first part of the thesis (Chapter 2 and Chapter 3) is focused on dealing with Data Quality, i.e. provides various definitions of Data Quality, gives reasoning for the importance of Data Quality in a company, and presents some of the most common tools and solutions that target to managing Data Quality in an organization. The second part of the thesis (Chapter 4 and Chapter 5) builds on the previous part and leads into measuring Data Quality using metrics, i.e. contains definition and purpose of Data Quality Metrics, places them into the multidimensional context (dimensions, hierarchies) and states five possible decompositions of Data Quality metrics into detail. The third part of the thesis (Chapter 6) contains the proposed Complex Data Quality Monitoring System including description of Data Quality Management related dimensions and processes, and most importantly detailed definition of bottom-level metrics used for calculation of the overall Data Quality.

APA, Harvard, Vancouver, ISO, and other styles

3

Yu, Wenyuan. "Improving data quality : data consistency, deduplication, currency and accuracy." Thesis, University of Edinburgh, 2013. http://hdl.handle.net/1842/8899.

Full text

Abstract:

Data quality is one of the key problems in data management. An unprecedented amount of data has been accumulated and has become a valuable asset of an organization. The value of the data relies greatly on its quality. However, data is often dirty in real life. It may be inconsistent, duplicated, stale, inaccurate or incomplete, which can reduce its usability and increase the cost of businesses. Consequently the need for improving data quality arises, which comprises of five central issues of improving data quality, namely, data consistency, data deduplication, data currency, data accuracy and information completeness. This thesis presents the results of our work on the first four issues with regards to data consistency, deduplication, currency and accuracy. The first part of the thesis investigates incremental verifications of data consistencies in distributed data. Given a distributed database D, a set S of conditional functional dependencies (CFDs), the set V of violations of the CFDs in D, and updates ΔD to D, it is to find, with minimum data shipment, changes ΔV to V in response to ΔD. Although the problems are intractable, we show that they are bounded: there exist algorithms to detect errors such that their computational cost and data shipment are both linear in the size of ΔD and ΔV, independent of the size of the database D. Such incremental algorithms are provided for both vertically and horizontally partitioned data, and we show that the algorithms are optimal. The second part of the thesis studies the interaction between record matching and data repairing. Record matching, the main technique underlying data deduplication, aims to identify tuples that refer to the same real-world object, and repairing is to make a database consistent by fixing errors in the data using constraints. These are treated as separate processes in most data cleaning systems, based on heuristic solutions. However, our studies show that repairing can effectively help us identify matches, and vice versa. To capture the interaction, a uniform framework that seamlessly unifies repairing and matching operations is proposed to clean a database based on integrity constraints, matching rules and master data. The third part of the thesis presents our study of finding certain fixes that are absolutely correct for data repairing. Data repairing methods based on integrity constraints are normally heuristic, and they may not find certain fixes. Worse still, they may even introduce new errors when attempting to repair the data, which may not work well when repairing critical data such as medical records, in which a seemingly minor error often has disastrous consequences. We propose a framework and an algorithm to find certain fixes, based on master data, a class of editing rules and user interactions. A prototype system is also developed. The fourth part of the thesis introduces inferring data currency and consistency for conflict resolution, where data currency aims to identify the current values of entities, and conflict resolution is to combine tuples that pertain to the same real-world entity into a single tuple and resolve conflicts, which is also an important issue for data deduplication. We show that data currency and consistency help each other in resolving conflicts. We study a number of associated fundamental problems, and develop an approach for conflict resolution by inferring data currency and consistency. The last part of the thesis reports our study of data accuracy on the longstanding relative accuracy problem which is to determine, given tuples t1 and t2 that refer to the same entity e, whether t1[A] is more accurate than t2[A], i.e., t1[A] is closer to the true value of the A attribute of e than t2[A]. We introduce a class of accuracy rules and an inference system with a chase procedure to deduce relative accuracy, and the related fundamental problems are studied. We also propose a framework and algorithms for inferring accurate values with users’ interaction.

APA, Harvard, Vancouver, ISO, and other styles

4

Peralta, Veronika. "Data Quality Evaluation in Data Integration Systems." Phd thesis, Université de Versailles-Saint Quentin en Yvelines, 2006. http://tel.archives-ouvertes.fr/tel-00325139.

Full text

Abstract:

Les besoins d'accéder, de façon uniforme, à des sources de données multiples, sont chaque jour plus forts, particulièrement, dans les systèmes décisionnels qui ont besoin d'une analyse compréhensive des données. Avec le développement des Systèmes d'Intégration de Données (SID), la qualité de l'information est devenue une propriété de premier niveau de plus en plus exigée par les utilisateurs. Cette thèse porte sur la qualité des données dans les SID. Nous nous intéressons, plus précisément, aux problèmes de l'évaluation de la qualité des données délivrées aux utilisateurs en réponse à leurs requêtes et de la satisfaction des exigences des utilisateurs en terme de qualité. Nous analysons également l'utilisation de mesures de qualité pour l'amélioration de la conception du SID et de la qualité des données. Notre approche consiste à étudier un facteur de qualité à la fois, en analysant sa relation avec le SID, en proposant des techniques pour son évaluation et en proposant des actions pour son amélioration. Parmi les facteurs de qualité qui ont été proposés, cette thèse analyse deux facteurs de qualité : la fraîcheur et l'exactitude des données. Nous analysons les différentes définitions et mesures qui ont été proposées pour la fraîcheur et l'exactitude des données et nous faisons émerger les propriétés du SID qui ont un impact important sur leur évaluation. Nous résumons l'analyse de chaque facteur par le biais d'une taxonomie, qui sert à comparer les travaux existants et à faire ressortir les problèmes ouverts. Nous proposons un canevas qui modélise les différents éléments liés à l'évaluation de la qualité tels que les sources de données, les requêtes utilisateur, les processus d'intégration du SID, les propriétés du SID, les mesures de qualité et les algorithmes d'évaluation de la qualité. En particulier, nous modélisons les processus d'intégration du SID comme des processus de workflow, dans lesquels les activités réalisent les tâches qui extraient, intègrent et envoient des données aux utilisateurs. Notre support de raisonnement pour l'évaluation de la qualité est un graphe acyclique dirigé, appelé graphe de qualité, qui a la même structure du SID et contient, comme étiquettes, les propriétés du SID qui sont relevants pour l'évaluation de la qualité. Nous développons des algorithmes d'évaluation qui prennent en entrée les valeurs de qualité des données sources et les propriétés du SID, et, combinent ces valeurs pour qualifier les données délivrées par le SID. Ils se basent sur la représentation en forme de graphe et combinent les valeurs des propriétés en traversant le graphe. Les algorithmes d'évaluation peuvent être spécialisés pour tenir compte des propriétés qui influent la qualité dans une application concrète. L'idée derrière le canevas est de définir un contexte flexible qui permet la spécialisation des algorithmes d'évaluation à des scénarios d'application spécifiques. Les valeurs de qualité obtenues pendant l'évaluation sont comparées à celles attendues par les utilisateurs. Des actions d'amélioration peuvent se réaliser si les exigences de qualité ne sont pas satisfaites. Nous suggérons des actions d'amélioration élémentaires qui peuvent être composées pour améliorer la qualité dans un SID concret. Notre approche pour améliorer la fraîcheur des données consiste à l'analyse du SID à différents niveaux d'abstraction, de façon à identifier ses points critiques et cibler l'application d'actions d'amélioration sur ces points-là. Notre approche pour améliorer l'exactitude des données consiste à partitionner les résultats des requêtes en portions (certains attributs, certaines tuples) ayant une exactitude homogène. Cela permet aux applications utilisateur de visualiser seulement les données les plus exactes, de filtrer les données ne satisfaisant pas les exigences d'exactitude ou de visualiser les données par tranche selon leur exactitude. Comparée aux approches existantes de sélection de sources, notre proposition permet de sélectionner les portions les plus exactes au lieu de filtrer des sources entières. Les contributions principales de cette thèse sont : (1) une analyse détaillée des facteurs de qualité fraîcheur et exactitude ; (2) la proposition de techniques et algorithmes pour l'évaluation et l'amélioration de la fraîcheur et l'exactitude des données ; et (3) un prototype d'évaluation de la qualité utilisable dans la conception de SID.

APA, Harvard, Vancouver, ISO, and other styles

5

Peralta, Costabel Veronika del Carmen. "Data quality evaluation in data integration systems." Versailles-St Quentin en Yvelines, 2006. http://www.theses.fr/2006VERS0020.

Full text

Abstract:

This thesis deals with data quality evaluation in Data Integration Systems (DIS). Specifically, we address the problems of evaluating the quality of the data conveyed to users in response to their queries and verifying if users’ quality expectations can be achieved. We also analyze how quality measures can be used for improving the DIS and enforcing data quality. Our approach consists in studying one quality factor at a time, analyzing its impact within a DIS, proposing techniques for its evaluation and proposing improvement actions for its enforcement. Among the quality factors that have been proposed, this thesis analyzes two of the most used ones: data freshness and data accuracy
Cette thèse porte sur la qualité des données dans les Systèmes d’Intégration de Données (SID). Nous nous intéressons, plus précisément, aux problèmes de l’évaluation de la qualité des données délivrées aux utilisateurs en réponse à leurs requêtes et de la satisfaction des exigences des utilisateurs en terme de qualité. Nous analysons également l’utilisation de mesures de qualité pour l’amélioration de la conception du SID et la conséquente amélioration de la qualité des données. Notre approche consiste à étudier un facteur de qualité à la fois, en analysant sa relation avec le SID, en proposant des techniques pour son évaluation et en proposant des actions pour son amélioration. Parmi les facteurs de qualité qui ont été proposés, cette thèse analyse deux facteurs de qualité : la fraîcheur et l’exactitude des données

APA, Harvard, Vancouver, ISO, and other styles

6

Deb, Rupam. "Data Quality Enhancement for Traffic Accident Data." Thesis, Griffith University, 2017. http://hdl.handle.net/10072/367725.

Full text

Abstract:

Death, injury, and disability resulting from road traffic crashes continue to be a major global public health problem. Recent data suggest that the number of fatalities from traffic crashes is in excess of 1.25 million people each year with non-fatal injuries affecting a further 20-50 million people. It is predicted that by 2030, road traffic accidents will have progressed to be the 5th leading cause of death and that the number of people who will die annually from traffic accidents will have doubled from current levels. Both developed and developing countries suffer from the consequences of the increase in human population, and consequently, vehicle numbers. Therefore, methods to reduce accident severity are of great interest to traffic agencies and the public at large. To analyze traffic accident factors effectively, a complete traffic accident historical database is needed. Road accident fatality rates depend on many factors, so it is a very challenging task to investigate the dependencies between the attributes because of the many environmental and road accident factors. Missing data and noisy data in the database obscure the discovery of important factors and lead to invalid conclusions.
Thesis (PhD Doctorate)
Doctor of Philosophy (PhD)
School of Information and Communication Technology
Science, Environment, Engineering and Technology
Full Text

APA, Harvard, Vancouver, ISO, and other styles

7

He, Ying Surveying &amp Spatial Information Systems Faculty of Engineering UNSW. "Spatial data quality management." Publisher:University of New South Wales. Surveying & Spatial Information Systems, 2008. http://handle.unsw.edu.au/1959.4/43323.

Full text

Abstract:

The applications of geographic information systems (GIS) in various areas have highlighted the importance of data quality. Data quality research has been given a priority by GIS academics for three decades. However, the outcomes of data quality research have not been sufficiently translated into practical applications. Users still need a GIS capable of storing, managing and manipulating data quality information. To fill this gap, this research aims to investigate how we can develop a tool that effectively and efficiently manages data quality information to aid data users to better understand and assess the quality of their GIS outputs. Specifically, this thesis aims: 1. To develop a framework for establishing a systematic linkage between data quality indicators and appropriate uncertainty models; 2. To propose an object-oriented data quality model for organising and documenting data quality information; 3. To create data quality schemas for defining and storing the contents of metadata databases; 4. To develop a new conceptual model of data quality management; 5. To develop and implement a prototype system for enhancing the capability of data quality management in commercial GIS. Based on reviews of error and uncertainty modelling in the literature, a conceptual framework has been developed to establish the systematic linkage between data quality elements and appropriate error and uncertainty models. To overcome the limitations identified in the review and satisfy a series of requirements for representing data quality, a new object-oriented data quality model has been proposed. It enables data quality information to be documented and stored in a multi-level structure and to be integrally linked with spatial data to allow access, processing and graphic visualisation. The conceptual model for data quality management is proposed where a data quality storage model, uncertainty models and visualisation methods are three basic components. This model establishes the processes involved when managing data quality, emphasising on the integration of uncertainty modelling and visualisation techniques. The above studies lay the theoretical foundations for the development of a prototype system with the ability to manage data quality. Object-oriented approach, database technology and programming technology have been integrated to design and implement the prototype system within the ESRI ArcGIS software. The object-oriented approach allows the prototype to be developed in a more flexible and easily maintained manner. The prototype allows users to browse and access data quality information at different levels. Moreover, a set of error and uncertainty models are embedded within the system. With the prototype, data quality elements can be extracted from the database and automatically linked with the appropriate error and uncertainty models, as well as with their implications in the form of simple maps. This function results in proposing a set of different uncertainty models for users to choose for assessing how uncertainty inherent in the data can affect their specific application. It will significantly increase the users' confidence in using data for a particular situation. To demonstrate the enhanced capability of the prototype, the system has been tested against the real data. The implementation has shown that the prototype can efficiently assist data users, especially non-expert users, to better understand data quality and utilise it in a more practical way. The methodologies and approaches for managing quality information presented in this thesis should serve as an impetus for supporting further research.

APA, Harvard, Vancouver, ISO, and other styles

8

Redgert, Rebecca. "Evaluating Data Quality in a Data Warehouse Environment." Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-208766.

Full text

Abstract:

The amount of data accumulated by organizations have grown significantly during the last couple of years, increasing the importance of data quality. Ensuring data quality for large amounts of data is a complicated task, but crucial to subsequent analysis. This study investigates how to maintain and improve data quality in a data warehouse. A case study of the errors in a data warehouse was conducted at the Swedish company Kaplan, and resulted in guiding principles on how to improve the data quality. The investigation was done by manually comparing data from the source systems to the data integrated in the data warehouse and applying a quality framework based on semiotic theory to identify errors. The three main guiding principles given are (1) to implement a standardized format for the source data, (2) to implement a check prior to integration where the source data are reviewed and corrected if necessary, and (3) to create and implement specific database integrity rules. Further work is encouraged on establishing a guide for the framework on how to best perform a manual approach for comparing data, and quality assurance of source data.
Mängden data som ackumulerats av organisationer har ökat betydligt under de senaste åren, vilket har ökat betydelsen för datakvalitet. Att säkerställa datakvalitet för stora mängder data är en komplicerad uppgift, men avgörande för efterföljande analys. Denna studie undersöker hur man underhåller och förbättrar datakvaliteten i ett datalager. En fallstudie av fel i ett datalager på det svenska företaget Kaplan genomfördes och resulterade i riktlinjer för hur datakvaliteten kan förbättras. Undersökningen gjordes genom att manuellt jämföra data från källsystemen med datat integrerat i datalagret och genom att tillämpa ett kvalitetsramverk grundat på semiotisk teori för att kunna identifiera fel. De tre huvudsakliga riktlinjerna som gavs är att (1) implementera ett standardiserat format för källdatat, (2) genomföra en kontroll före integration där källdatat granskas och korrigeras vid behov, och (3) att skapa och implementera specifika databasintegritetsregler. Vidare forskning uppmuntras för att skapa en guide till ramverket om hur man bäst jämför data genom en manuell undersökning, och kvalitetssäkring av källdata.

APA, Harvard, Vancouver, ISO, and other styles

9

Bringle, Per. "Data Quality in Data Warehouses: a Case Study." Thesis, University of Skövde, Department of Computer Science, 1999. http://urn.kb.se/resolve?urn=urn:nbn:se:his:diva-404.

Full text

Abstract:

Companies today experience problems with poor data quality in their systems. Because of the enormous amount of data in companies, the data has to be of good quality if companies want to take advantage of it. Since the purpose with a data warehouse is to gather information from several databases for decision support, it is absolutely vital that data is of good quality. There exists several ways of determining or classifying data quality in databases. In this work the data quality management in a large Swedish company's data warehouse is examined, through a case study, using a framework specialized for data warehouses. The quality of data is examined from syntactic, semantic and pragmatic point of view. The results of the examination is then compared with a similar case study previously conducted in order to find any differences and similarities.

APA, Harvard, Vancouver, ISO, and other styles

10

Li, Lin. "Data quality and data cleaning in database applications." Thesis, Edinburgh Napier University, 2012. http://researchrepository.napier.ac.uk/Output/5788.

Full text

Abstract:

Today, data plays an important role in people's daily activities. With the help of some database applications such as decision support systems and customer relationship management systems (CRM), useful information or knowledge could be derived from large quantities of data. However, investigations show that many such applications fail to work successfully. There are many reasons to cause the failure, such as poor system infrastructure design or query performance. But nothing is more certain to yield failure than lack of concern for the issue of data quality. High quality of data is a key to today's business success. The quality of any large real world data set depends on a number of factors among which the source of the data is often the crucial factor. It has now been recognized that an inordinate proportion of data in most data sources is dirty. Obviously, a database application with a high proportion of dirty data is not reliable for the purpose of data mining or deriving business intelligence and the quality of decisions made on the basis of such business intelligence is also unreliable. In order to ensure high quality of data, enterprises need to have a process, methodologies and resources to monitor and analyze the quality of data, methodologies for preventing and/or detecting and repairing dirty data. This thesis is focusing on the improvement of data quality in database applications with the help of current data cleaning methods. It provides a systematic and comparative description of the research issues related to the improvement of the quality of data, and has addressed a number of research issues related to data cleaning. In the first part of the thesis, related literature of data cleaning and data quality are reviewed and discussed. Building on this research, a rule-based taxonomy of dirty data is proposed in the second part of the thesis. The proposed taxonomy not only summarizes the most dirty data types but is the basis on which the proposed method for solving the Dirty Data Selection (DDS) problem during the data cleaning process was developed. This helps us to design the DDS process in the proposed data cleaning framework described in the third part of the thesis. This framework retains the most appealing characteristics of existing data cleaning approaches, and improves the efficiency and effectiveness of data cleaning as well as the degree of automation during the data cleaning process. Finally, a set of approximate string matching algorithms are studied and experimental work has been undertaken. Approximate string matching is an important part in many data cleaning approaches which has been well studied for many years. The experimental work in the thesis confirmed the statement that there is no clear best technique. It shows that the characteristics of data such as the size of a dataset, the error rate in a dataset, the type of strings in a dataset and even the type of typo in a string will have significant effect on the performance of the selected techniques. In addition, the characteristics of data also have effect on the selection of suitable threshold values for the selected matching algorithms. The achievements based on these experimental results provide the fundamental improvement in the design of 'algorithm selection mechanism' in the data cleaning framework, which enhances the performance of data cleaning system in database applications.

APA, Harvard, Vancouver, ISO, and other styles

11

Wad, Charudatta V. "QoS : quality driven data abstraction for large databases." Worcester, Mass. : Worcester Polytechnic Institute, 2008. http://www.wpi.edu/Pubs/ETD/Available/etd-020508-151213/.

Full text

APA, Harvard, Vancouver, ISO, and other styles

12

Kara, Madjid. "Data quality for the decision of the ambient systems." Thesis, Université Paris-Saclay (ComUE), 2018. http://www.theses.fr/2018SACLV009.

Full text

Abstract:

La qualité des données est une condition commune à tous les projets de technologie de l'information, elle est devenue un domaine de recherche complexe avec la multiplicité et l’expansion des différentes sources de données. Des chercheurs se sont penchés sur l’axe de la modélisation et l’évaluation des données, plusieurs approches ont été proposées mais elles étaient limitées à un domaine d’utilisation bien précis et n’offraient pas un profil de qualité nous permettant d’évaluer un modèle de qualité de données global. L’évaluation basée sur les modèles de qualité ISO a fait son apparition, néanmoins ces modèles ne nous guident pas pour leurs utilisation, le fait de devoir les adapter à chaque cas de figure sans avoir de méthodes précises. Notre travail se focalise sur les problèmes de la qualité des données d'un système ambiant où les contraintes de temps pour la prise de décision sont plus importantes par rapport aux applications traditionnelles. L'objectif principal est de fournir au système décisionnel une vision très spécifique de la qualité des données issues des capteurs. Nous identifions les aspects quantifiables des données capteurs pour les relier aux métriques appropriées de notre modèle de qualité de données spécifique. Notre travail présente les contributions suivantes : (i) création d’un modèle de qualité de données générique basé sur plusieurs standards de qualité existants, (ii) formalisation du modèle de qualité sous forme d’une ontologie qui nous permet l’intégration de ces modèles (de i), en spécifiant les différents liens, appelés relations d'équivalence, qui existent entre les critères composant ces modèles, (iii) proposition d’un algorithme d’instanciation pour extraire le modèle de qualité de données spécifique à partir du modèle de qualité de données générique, (iv) proposition d’une approche d’évaluation globale du modèle de qualité de données spécifique en utilisant deux processus, le premier processus consiste à exécuter les métriques reliées aux données capteurs et le deuxième processus récupère le résultat de cette exécution et utilise le principe de la logique floue pour l’évaluation des facteurs de qualité de notre modèle de qualité de données spécifique. Puis, l'expert établie des valeurs représentant le poids de chaque facteur en se basant sur la table d'interdépendance pour prendre en compte l'interaction entre les différents critères de données et on utilisera la procédure d'agrégation pour obtenir un degré de confiance. En ce basant sur ce résultat final, le composant décisionnel fera une analyse puis prendra une décision
Data quality is a common condition to all information technology projects; it has become a complex research domain with the multiplicity and expansion of different data sources. Researchers have studied the axis of modeling and evaluating data, several approaches have been proposed but they are limited to a specific use field and did not offer a quality profile enabling us to evaluate a global quality model. The evaluation based on ISO quality models has emerged; however, these models do not guide us for their use, having to adapt them to each scenario without precise methods. Our work focuses on the data quality issues of an ambient system where the time constraints for decision-making is greater compared to traditional applications. The main objective is to provide the decision-making system with a very specific view of the sensors data quality. We identify the quantifiable aspects of sensors data to link them to the appropriate metrics of our specified data quality model. Our work presents the following contributions: (i) creating a generic data quality model based on several existing data quality standards, (ii) formalizing the data quality models under an ontology, which allows integrating them (of i) by specifying various links, named equivalence relations between the criteria composing these models, (iii) proposing an instantiation algorithm to extract the specified data quality model from the generic data quality models, (iv) proposing a global evaluation approach of the specified data quality model using two processes, the first one consists in executing the metrics based on sensors data and the second one recovers the result of the first process and uses the concept of fuzzy logic to evaluate the factors of our specified data quality model. Then, the expert defines weight values based on the interdependence table of the model to take account the interaction between criteria and use the aggregation procedure to get a degree of confidence value. Based on the final result, the decisional component makes an analysis to make a decision

APA, Harvard, Vancouver, ISO, and other styles

13

Barker, James M. "Data governance| The missing approach to improving data quality." Thesis, University of Phoenix, 2017. http://pqdtopen.proquest.com/#viewpdf?dispub=10248424.

Full text

Abstract:

In an environment where individuals use applications to drive activities from what book to purchase, what film to view, to what temperature to heat a home, data is the critical element. To make things work data must be correct, complete, and accurate. Many firms view data governance as a panacea to the ills of systems and organizational challenge while other firms struggle to generate the value of these programs. This paper documents a study that was executed to understand what is being done by firms in the data governance space and why? The conceptual framework that was established from the literature on the subject was a set of six areas that should be addressed for a data governance program including: data governance councils; data quality; master data management; data security; policies and procedures; and data architecture. There is a wide range of experiences and ways to address data quality and the focus needs to be on execution. This explanatory case study examined the experiences of 100 professionals at 41 firms to understand what is being done and why professionals are undertaking such an endeavor. The outcome is that firms need to address data quality, data security, and operational standards in a manner that is organized around business value including strong business leader sponsorship and a documented dynamic business case. The outcome of this study provides a foundation for data governance program success and a guide to getting started.

APA, Harvard, Vancouver, ISO, and other styles

14

Wolf, Hilke. "Data Quality Bench-Marking for High Resolution Bragg Data." Doctoral thesis, Niedersächsische Staats- und Universitätsbibliothek Göttingen, 2014. http://hdl.handle.net/11858/00-1735-0000-0022-5DE2-A.

Full text

APA, Harvard, Vancouver, ISO, and other styles

15

Swapna, B., and R. VijayaPrakash. "Privacy Preserving Data Mining Operations without Disrupting Data Quality." International Journal of Computer Science and Network (IJCSN), 2012. http://hdl.handle.net/10150/271473.

Full text

Abstract:

Data mining operations have become prevalent as they can extract trends or patterns that help in taking good business decisions. Often they operate on large historical databases or data warehouses to obtain actionable knowledge or business intelligence that helps in taking well informed decisions. In the data mining domain there came many tools to perform data mining operations. These tools are best used to obtain actionable knowledge from data. Manually doing this is not possible as the data is very huge and takes lot of time. Thus the data mining domain is being improved in a rapid pace. While data mining operations are very useful in obtaining business intelligence, they also have some drawbacks that are they get sensitive information from the database. People may misuse the freedom given by obtaining sensitive information illegally. Preserving privacy of data is also important. Towards this end many Privacy Preserving Data Mining (PPDM) algorithms came into existence that sanitize data to prevent data mining algorithms from extracting sensitive information from the databases.
Data mining operations help discover business intelligence from historical data. The extracted business intelligence or actionable knowledge helps in taking well informed decisions that leads to profit to the organization that makes use of it. While performing mining privacy of data has to be given utmost importance. To achieve this PPDM (Privacy Preserving Data Mining) came into existence by sanitizing database that prevents discovery of association rules. However, this leads to modification of data and thus disrupting the quality of data. This paper proposes a new technique and algorithms that can perform privacy preserving data mining operations while ensuring that the data quality is not lost. The empirical results revealed that the proposed technique is useful and can be used in real world applications.

APA, Harvard, Vancouver, ISO, and other styles

16

Ma, Shuai. "Extending dependencies for improving data quality." Thesis, University of Edinburgh, 2011. http://hdl.handle.net/1842/5045.

Full text

Abstract:

This doctoral thesis presents the results of my work on extending dependencies for improving data quality, both in a centralized environment with a single database and in a data exchange and integration environment with multiple databases. The first part of the thesis proposes five classes of data dependencies, referred to as CINDs, eCFDs, CFDcs, CFDps and CINDps, to capture data inconsistencies commonly found in practice in a centralized environment. For each class of these dependencies, we investigate two central problems: the satisfiability problem and the implication problem. The satisfiability problem is to determine given a set Σ of dependencies defined on a database schema R, whether or not there exists a nonempty database D of R that satisfies Σ. And the implication problem is to determine whether or not a set Σ of dependencies defined on a database schema R entails another dependency φ on R. That is, for each database D ofRthat satisfies Σ, the D must satisfy φ as well. These are important for the validation and optimization of data-cleaning processes. We establish complexity results of the satisfiability problem and the implication problem for all these five classes of dependencies, both in the absence of finite-domain attributes and in the general setting with finite-domain attributes. Moreover, SQL-based techniques are developed to detect data inconsistencies for each class of the proposed dependencies, which can be easily implemented on the top of current database management systems. The second part of the thesis studies three important topics for data cleaning in a data exchange and integration environment with multiple databases. One is the dependency propagation problem, which is to determine, given a view defined on data sources and a set of dependencies on the sources, whether another dependency is guaranteed to hold on the view. We investigate dependency propagation for views defined in various fragments of relational algebra, conditional functional dependencies (CFDs) [FGJK08] as view dependencies, and for source dependencies given as either CFDs or traditional functional dependencies (FDs). And we establish lower and upper bounds, all matching, ranging from PTIME to undecidable. These not only provide the first results for CFD propagation, but also extend the classical work of FD propagation by giving new complexity bounds in the presence of a setting with finite domains. We finally provide the first algorithm for computing a minimal cover of all CFDs propagated via SPC views. The algorithm has the same complexity as one of the most efficient algorithms for computing a cover of FDs propagated via a projection view, despite the increased expressive power of CFDs and SPC views. Another one is matching records from unreliable data sources. A class of matching dependencies (MDs) is introduced for specifying the semantics of unreliable data. As opposed to static constraints for schema design such as FDs, MDs are developed for record matching, and are defined in terms of similarity metrics and a dynamic semantics. We identify a special case of MDs, referred to as relative candidate keys (RCKs), to determine what attributes to compare and how to compare them when matching records across possibly different relations. We also propose a mechanism for inferring MDs with a sound and complete system, a departure from traditional implication analysis, such that when we cannot match records by comparing attributes that contain errors, we may still find matches by using other, more reliable attributes. We finally provide a quadratic time algorithm for inferring MDs, and an effective algorithm for deducing quality RCKs from a given set of MDs. The last one is finding certain fixes for data monitoring [CGGM03, SMO07], which is to find and correct errors in a tuple when it is created, either entered manually or generated by some process. That is, we want to ensure that a tuple t is clean before it is used, to prevent errors introduced by adding t. As noted by [SMO07], it is far less costly to correct a tuple at the point of entry than fixing it afterward. Data repairing based on integrity constraints may not find certain fixes that are absolutely correct, and worse, may introduce new errors when repairing the data. We propose a method for finding certain fixes, based on master data, a notion of certain regions, and a class of editing rules. A certain region is a set of attributes that are assured correct by the users. Given a certain region and master data, editing rules tell us what attributes to fix and how to update them. We show how the method can be used in data monitoring and enrichment. We develop techniques for reasoning about editing rules, to decide whether they lead to a unique fix and whether they are able to fix all the attributes in a tuple, relative to master data and a certain region. We also provide an algorithm to identify minimal certain regions, such that a certain fix is warranted by editing rules and master data as long as one of the regions is correct.

APA, Harvard, Vancouver, ISO, and other styles

17

Angeles, Maria del Pilar. "Management of data quality when integrating data with known provenance." Thesis, Heriot-Watt University, 2007. http://hdl.handle.net/10399/64.

Full text

APA, Harvard, Vancouver, ISO, and other styles

18

Diallo, Thierno Mahamoudou. "Discovering data quality rules in a master data management context." Thesis, Lyon, INSA, 2013. http://www.theses.fr/2013ISAL0067.

Full text

Abstract:

Le manque de qualité des données continue d'avoir un impact considérable pour les entreprises. Ces problèmes, aggravés par la quantité de plus en plus croissante de données échangées, entrainent entre autres un surcoût financier et un rallongement des délais. De ce fait, trouver des techniques efficaces de correction des données est un sujet de plus en plus pertinent pour la communauté scientifique des bases de données. Par exemple, certaines classes de contraintes comme les Dépendances Fonctionnelles Conditionnelles (DFCs) ont été récemment introduites pour le nettoyage de données. Les méthodes de nettoyage basées sur les CFDs sont efficaces pour capturer les erreurs mais sont limitées pour les corriger . L’essor récent de la gestion de données de référence plus connu sous le sigle MDM (Master Data Management) a permis l'introduction d'une nouvelle classe de règle de qualité de données: les Règles d’Édition (RE) qui permettent d'identifier les attributs en erreur et de proposer les valeurs correctes correspondantes issues des données de référence. Ces derniers étant de très bonne qualité. Cependant, concevoir ces règles manuellement est un processus long et coûteux. Dans cette thèse nous développons des techniques pour découvrir de manière automatique les RE à partir des données source et des données de référence. Nous proposons une nouvelle sémantique des RE basée sur la satisfaction. Grace à cette nouvelle sémantique le problème de découverte des RE se révèle être une combinaison de la découverte des DFCs et de l'extraction des correspondances entre attributs source et attributs des données de référence. Nous abordons d'abord la découverte des DFCs, en particulier la classe des DFCs constantes très expressives pour la détection d'incohérence. Nous étendons des techniques conçues pour la découverte des traditionnelles dépendances fonctionnelles. Nous proposons ensuite une méthode basée sur les dépendances d'inclusion pour extraire les correspondances entre attributs source et attributs des données de référence avant de construire de manière automatique les RE. Enfin nous proposons quelques heuristiques d'application des ER pour le nettoyage de données. Les techniques ont été implémenté et évalué sur des données synthétiques et réelles montrant la faisabilité et la robustesse de nos propositions
Dirty data continues to be an important issue for companies. The datawarehouse institute [Eckerson, 2002], [Rockwell, 2012] stated poor data costs US businesses $611 billion dollars annually and erroneously priced data in retail databases costs US customers $2.5 billion each year. Data quality becomes more and more critical. The database community pays a particular attention to this subject where a variety of integrity constraints like Conditional Functional Dependencies (CFD) have been studied for data cleaning. Repair techniques based on these constraints are precise to catch inconsistencies but are limited on how to exactly correct data. Master data brings a new alternative for data cleaning with respect to it quality property. Thanks to the growing importance of Master Data Management (MDM), a new class of data quality rule known as Editing Rules (ER) tells how to fix errors, pointing which attributes are wrong and what values they should take. The intuition is to correct dirty data using high quality data from the master. However, finding data quality rules is an expensive process that involves intensive manual efforts. It remains unrealistic to rely on human designers. In this thesis, we develop pattern mining techniques for discovering ER from existing source relations with respect to master relations. In this set- ting, we propose a new semantics of ER taking advantage of both source and master data. Thanks to the semantics proposed in term of satisfaction, the discovery problem of ER turns out to be strongly related to the discovery of both CFD and one-to-one correspondences between sources and target attributes. We first attack the problem of discovering CFD. We concentrate our attention to the particular class of constant CFD known as very expressive to detect inconsistencies. We extend some well know concepts introduced for traditional Functional Dependencies to solve the discovery problem of CFD. Secondly, we propose a method based on INclusion Dependencies to extract one-to-one correspondences from source to master attributes before automatically building ER. Finally we propose some heuristics of applying ER to clean data. We have implemented and evaluated our techniques on both real life and synthetic databases. Experiments show both the feasibility, the scalability and the robustness of our proposal

APA, Harvard, Vancouver, ISO, and other styles

19

Gens, Rüdiger. "Quality assessment of SAR interferometric data." Hannover : Fachrichtung Vermessungswesen der Univ, 1998. http://deposit.ddb.de/cgi-bin/dokserv?idn=95607121X.

Full text

APA, Harvard, Vancouver, ISO, and other styles

20

Berg, Marcus. "Evaluating Quality of Online Behavior Data." Thesis, Stockholms universitet, Statistiska institutionen, 2013. http://urn.kb.se/resolve?urn=urn:nbn:se:su:diva-97524.

Full text

Abstract:

This thesis has two purposes; emphasizing the importance of data quality of Big Data, and identifying and evaluating potential error sources in JavaScript tracking (a client side on - site online behavior clickstream data collection method commonly used in web analytics). The importance of data quality of Big Data is emphasized through the evaluation of JavaScript tracking. The Total Survey Error framework is applied to JavaScript tracking and 17 nonsampling error sources are identified and evaluated. The bias imposed by these error sources varies from large to small, but the major takeaway is the large number of error sources actually identified. More work is needed. Big Data has much to gain from quality work. Similarly, there is much that can be done with statistics in web analytics.

APA, Harvard, Vancouver, ISO, and other styles

21

Kim, Jin Mo. "Name matching for data quality mediator." Thesis, Massachusetts Institute of Technology, 1995. http://hdl.handle.net/1721.1/36588.

Full text

APA, Harvard, Vancouver, ISO, and other styles

22

Viklund, Adam. "Data Quality Study of AMR Systems." Thesis, Uppsala universitet, Institutionen för informationsteknologi, 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-269465.

Full text

Abstract:

Energy metering is a constantly changing field with increasing demands to get more measurement data. The implications are systems that are evolving and improving. It is important for data to be of high quality in these systems. This thesis set out to investigate data quality in advanced meter reading (AMR) systems that are used by energy companies in Sweden today. In order to investigate data quality, a definition was suggested. The definition was used as a basis for interviewing users of AMR systems to figure out the user experience of data quality and to understand what features improve data quality. The interviews were conducted with six different users working on companies that distributes electricity and/or district heating to companies and consumers. The features improving data quality were used to assess data quality in the open source AMR system called Gurux. A redesign was proposed to improve data quality in Gurux. The data quality parameter that needed to be improved the most was data accessibility. The conclusion of this master's thesis includes that there are many systems where data quality can be improved according to the perspectives given by the interviewees. Gurux is a system that can help improve data quality by making changes suggested in this thesis.

APA, Harvard, Vancouver, ISO, and other styles

23

Aljumaili, Mustafa. "Data Quality Assessment : Applied in Maintenance." Doctoral thesis, Luleå tekniska universitet, Drift, underhåll och akustik, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:ltu:diva-26088.

Full text

Abstract:

Godkänd; 2016; 20160126 (musalj); Nedanstående person kommer att disputera för avläggande av teknologie doktorsexamen. Namn: Mustafa Aljumaili Ämne: Drift och underhållsteknik/Operation and Maintenance Engineering Avhandling: Data Quality Assessment: Applied in Maintenance Opponent: Docent Mirka Kans, Institutionen för Maskinteknik, Linnéuniversitetet, Växjö. Ordförande: Professor Uday Kumar, Avdelning för Drift, underhåll och akustik, Institutionen för samhällsbyggnad och naturresurser, Luleå tekniska universitet. Tid: Fredag 4 mars 2016, kl 10.00 Plats: F1031, Luleå tekniska universitet

APA, Harvard, Vancouver, ISO, and other styles

24

Edwards, Matthew. "Data quality measures for identity resolution." Thesis, Lancaster University, 2018. http://eprints.lancs.ac.uk/124402/.

Full text

Abstract:

The explosion in popularity of online social networks has led to increased interest in identity resolution from security practitioners. Being able to connect together the multiple online accounts of a user can be of use in verifying identity attributes and in tracking the activity of malicious users. At the same time, privacy researchers are exploring the same phenomenon with interest in identifying privacy risks caused by re-identification attacks. Existing literature has explored how particular components of an online identity may be used to connect profiles, but few if any studies have attempted to assess the comparative value of information attributes. In addition, few of the methods being reported are easily comparable, due to difficulties with obtaining and sharing ground- truth data. Attempts to gain a comprehensive understanding of the identifiability of profile attributes are hindered by these issues. With a focus on overcoming these hurdles to effective research, this thesis first develops a methodology for sampling ground-truth data from online social networks. Building on this with reference to both existing literature and samples of real profile data, this thesis describes and grounds a comprehensive matching schema of profile attributes. The work then defines data quality measures which are important for identity resolution, and measures the availability, consistency and uniqueness of the schema’s contents. The developed measurements are then applied in a feature selection scheme to reduce the impact of missing data issues common in identity resolution. Finally, this thesis addresses the purposes to which identity resolution may be applied, defining the further application-oriented data quality measurements of novelty, veracity and relevance, and demonstrating their calculation and application for a particular use case: evaluating the social engineering vulnerability of an organisation.

APA, Harvard, Vancouver, ISO, and other styles

25

Zhu, Zhaochen. "Computational methods in air quality data." HKBU Institutional Repository, 2017. https://repository.hkbu.edu.hk/etd_oa/402.

Full text

Abstract:

In this thesis, we have investigated several computational methods on data assimilation for air quality prediction, especially on the characteristic of sparse matrix and the underlying information of gradient in the concentration of pollutant species. In the first part, we have studied the ensemble Kalman filter (EnKF) for chemical species simulation in air quality forecast data assimilation. The main contribution of this paper is to study the sparse data observations and make use of the matrix structure of the Kalman filter updated equations to design an algorithm to compute the analysis of chemical species in the air quality forecast system efficiently. The proposed method can also handle the combined observations from multiple species together. We have applied the proposed method and tested its performance for real air quality data assimilation. Numerical examples have demonstrated the efficiency of the proposed computational method for Kalman filter update, and the effectiveness of the proposed method for NO2, NO, CO, SO2, O3, PM2.5 and PM10 in air quality data assimilation. Within the third part, we have set up an automatic workflow to connect the management system of the chemical transport model - CMAQ with our proposed data assimilation methods. The setup has successfully integrated the data assimilation into the management system and shown that the accuracy of the prediction has risen to a new level. This technique has transformed the system into a real-time and high-precision system. When the new observations are available, the predictions can then be estimated almost instantaneously. Then the agencies are able to make the decisions and respond to the situations immediately. In this way, citizens are able to protect themselves effectively. Meanwhile, it allows the mathematical algorithm to be industrialized implying that the improvements on data assimilation have directly positive effects on the developments of the environment, the human health and the society. Therefore, this has become an inspiring indication to encourage us to study, achieve and even devote more research into this promising method.

APA, Harvard, Vancouver, ISO, and other styles

26

Nitesh, Varma Rudraraju Nitesh, and Boyanapally Varun Varun. "Data Quality Model for Machine Learning." Thesis, Blekinge Tekniska Högskola, Institutionen för programvaruteknik, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-18498.

Full text

Abstract:

Context: - Machine learning is a part of artificial intelligence, this area is now continuously growing day by day. Most internet related services such as Social media service, Email Spam, E-commerce sites, Search engines are now using machine learning. The Quality of machine learning output relies on the input data, so the input data is crucial for machine learning and good quality of input data can give a better outcome to the machine learning system. In order to achieve quality data, a data scientist can use a data quality model on data of machine learning. Data quality model can help data scientists to monitor and control the input data of machine learning. But there is no considerable amount of research done on data quality attributes and data quality model for machine learning. Objectives: - The primary objectives of this paper are to find and understand the state-of-art and state-of-practice on data quality attributes for machine learning, and to develop a data quality model for machine learning in collaboration with data scientists. Methods: - This paper mainly consists of two studies: - 1) Conducted a literature review in the different database in order to identify literature on data quality attributes and data quality model for machine learning. 2) An in-depth interview study was conducted to allow a better understanding and verifying of data quality attributes that we identified from our literature review study, this process is carried out with the collaboration of data scientists from multiple locations. Totally of 15 interviews were performed and based on the results we proposed a data quality model based on these interviewees perspective. Result: - We identified 16 data quality attributes as important from our study which is based on the perspective of experienced data scientists who were interviewed in this study. With these selected data quality attributes, we proposed a data quality model with which quality of data for machine learning can be monitored and improved by data scientists, and effects of these data quality attributes on machine learning have also been stated. Conclusion: - This study signifies the importance of quality of data, for which we proposed a data quality model for machine learning based on the industrial experiences of a data scientist. This research gap is a benefit to all machine learning practitioners and data scientists who intended to identify quality data for machine learning. In order to prove that data quality attributes in the data quality model are important, a further experiment can be conducted, which is proposed in future work.

APA, Harvard, Vancouver, ISO, and other styles

27

Issa, Subhi. "Linked data quality : completeness and conciseness." Electronic Thesis or Diss., Paris, CNAM, 2019. http://www.theses.fr/2019CNAM1274.

Full text

Abstract:

La large diffusion des technologies du Web Sémantique telles que le Resource Description Framework (RDF) permet aux individus de construire leurs bases de données sur le Web, d'écrire des vocabulaires et de définir des règles pour organiser et expliquer les relations entre les données selon les principes des données liées. En conséquence, une grande quantité de données structurées et interconnectées est générée quotidiennement. Un examen attentif de la qualité de ces données pourrait s'avérer très critique, surtout si d'importantes recherches et décisions professionnelles en dépendent. La qualité des données liées est un aspect important pour indiquer leur aptitude à être utilisées dans des applications. Plusieurs dimensions permettant d'évaluer la qualité des données liées sont identifiées, telles que la précision, la complétude, la provenance et la concision. Cette thèse se concentre sur l'évaluation de la complétude et l'amélioration de la concision des données liées. En particulier, nous avons d'abord proposé une approche de calcul de complétude fondée sur un schéma généré. En effet, comme un schéma de référence est nécessaire pour évaluer la complétude, nous avons proposé une approche fondée sur la fouille de données pour obtenir un schéma approprié (c.-à-d. un ensemble de propriétés) à partir des données. Cette approche permet de distinguer les propriétés essentielles des propriétés marginales pour générer, pour un ensemble de données, un schéma conceptuel qui répond aux attentes de l'utilisateur quant aux contraintes de complétude des données. Nous avons implémenté un prototype appelé "LOD-CM" pour illustrer le processus de dérivation d'un schéma conceptuel d'un ensemble de données fondé sur les besoins de l'utilisateur. Nous avons également proposé une approche pour découvrir des prédicats équivalents afin d'améliorer la concision des données liées. Cette approche s'appuie, en plus d'une analyse statistique, sur une analyse sémantique approfondie des données et sur des algorithmes d'apprentissage. Nous soutenons que l'étude de la signification des prédicats peut aider à améliorer l'exactitude des résultats. Enfin, un ensemble d'expériences a été mené sur des ensembles de données réelles afin d'évaluer les approches que nous proposons
The wide spread of Semantic Web technologies such as the Resource Description Framework (RDF) enables individuals to build their databases on the Web, to write vocabularies, and define rules to arrange and explain the relationships between data according to the Linked Data principles. As a consequence, a large amount of structured and interlinked data is being generated daily. A close examination of the quality of this data could be very critical, especially, if important research and professional decisions depend on it. The quality of Linked Data is an important aspect to indicate their fitness for use in applications. Several dimensions to assess the quality of Linked Data are identified such as accuracy, completeness, provenance, and conciseness. This thesis focuses on assessing completeness and enhancing conciseness of Linked Data. In particular, we first proposed a completeness calculation approach based on a generated schema. Indeed, as a reference schema is required to assess completeness, we proposed a mining-based approach to derive a suitable schema (i.e., a set of properties) from data. This approach distinguishes between essential properties and marginal ones to generate, for a given dataset, a conceptual schema that meets the user's expectations regarding data completeness constraints. We implemented a prototype called “LOD-CM” to illustrate the process of deriving a conceptual schema of a dataset based on the user's requirements. We further proposed an approach to discover equivalent predicates to improve the conciseness of Linked Data. This approach is based, in addition to a statistical analysis, on a deep semantic analysis of data and on learning algorithms. We argue that studying the meaning of predicates can help to improve the accuracy of results. Finally, a set of experiments was conducted on real-world datasets to evaluate our proposed approaches

APA, Harvard, Vancouver, ISO, and other styles

28

Sehat, Mahdis, and FLORES RENÉ PAVEZ. "Customer Data Management." Thesis, KTH, Industriell ekonomi och organisation (Avd.), 2012. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-109251.

Full text

Abstract:

Abstract As the business complexity, number of customers continues to grow and customers evolve into multinational organisations that operate across borders, many companies are faced with great challenges in the way they manage their customer data. In today’s business, a single customer may have a relationship with several entities of an organisation, which means that the customer data is collected through different channels. One customer may be described in different ways by each entity, which makes it difficult to obtain a unified view of the customer. In companies where there are several sources of data and the data is distributed to several systems, data environments become heterogenic. In this state, customer data is often incomplete, inaccurate and inconsistent throughout the company. This thesis aims to study how organisations with heterogeneous customer data sources implement the Master Data Management (MDM) concept to achieve and maintain high customer data quality. The purpose is to provide recommendations for how to achieve successful customer data management using MDM based on existing literature related to the topic and an interview-based empirical study. Successful customer data management is more of an organisational issue than a technological one and requires a top-down approach in order to develop a common strategy for an organisation’s customer data management. Proper central assessment and maintenance processes that can be adjusted according to the entities’ needs must be in place. Responsibilities for the maintenance of customer data should be delegated to several levels of an organisation in order to better manage customer data.

APA, Harvard, Vancouver, ISO, and other styles

29

Landelius, Cecilia. "Data governance in big data : How to improve data quality in a decentralized organization." Thesis, KTH, Industriell ekonomi och organisation (Inst.), 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-301258.

Full text

Abstract:

The use of internet has increased the amount of data available and gathered. Companies are investing in big data analytics to gain insights from this data. However, the value of the analysis and decisions made based on it, is dependent on the quality ofthe underlying data. For this reason, data quality has become a prevalent issue for organizations. Additionally, failures in data quality management are often due to organizational aspects. Due to the growing popularity of decentralized organizational structures, there is a need to understand how a decentralized organization can improve data quality. This thesis conducts a qualitative single case study of an organization currently shifting towards becoming data driven and struggling with maintaining data quality within the logistics industry. The purpose of the thesis is to answer the questions: • RQ1: What is data quality in the context of logistics data? • RQ2: What are the obstacles for improving data quality in a decentralized organization? • RQ3: How can these obstacles be overcome? Several data quality dimensions were identified and categorized as critical issues,issues and non-issues. From the gathered data the dimensions completeness, accuracy and consistency were found to be critical issues of data quality. The three most prevalent obstacles for improving data quality were data ownership, data standardization and understanding the importance of data quality. To overcome these obstacles the most important measures are creating data ownership structures, implementing data quality practices and changing the mindset of the employees to a data driven mindset. The generalizability of a single case study is low. However, there are insights and trends which can be derived from the results of this thesis and used for further studies and companies undergoing similar transformations.
Den ökade användningen av internet har ökat mängden data som finns tillgänglig och mängden data som samlas in. Företag påbörjar därför initiativ för att analysera dessa stora mängder data för att få ökad förståelse. Dock är värdet av analysen samt besluten som baseras på analysen beroende av kvaliteten av den underliggande data. Av denna anledning har datakvalitet blivit en viktig fråga för företag. Misslyckanden i datakvalitetshantering är ofta på grund av organisatoriska aspekter. Eftersom decentraliserade organisationsformer blir alltmer populära, finns det ett behov av att förstå hur en decentraliserad organisation kan arbeta med frågor som datakvalitet och dess förbättring. Denna uppsats är en kvalitativ studie av ett företag inom logistikbranschen som i nuläget genomgår ett skifte till att bli datadrivna och som har problem med att underhålla sin datakvalitet. Syftet med denna uppsats är att besvara frågorna: • RQ1: Vad är datakvalitet i sammanhanget logistikdata? • RQ2: Vilka är hindren för att förbättra datakvalitet i en decentraliserad organisation? • RQ3: Hur kan dessa hinder överkommas? Flera datakvalitetsdimensioner identifierades och kategoriserades som kritiska problem, problem och icke-problem. Från den insamlade informationen fanns att dimensionerna, kompletthet, exakthet och konsekvens var kritiska datakvalitetsproblem för företaget. De tre mest förekommande hindren för att förbättra datakvalité var dataägandeskap, standardisering av data samt att förstå vikten av datakvalitet. För att överkomma dessa hinder är de viktigaste åtgärderna att skapa strukturer för dataägandeskap, att implementera praxis för hantering av datakvalitet samt att ändra attityden hos de anställda gentemot datakvalitet till en datadriven attityd. Generaliseringsbarheten av en enfallsstudie är låg. Dock medför denna studie flera viktiga insikter och trender vilka kan användas för framtida studier och för företag som genomgår liknande transformationer.

APA, Harvard, Vancouver, ISO, and other styles

30

Huang, Shiping. "Exploratory visualization of data with variable quality." Link to electronic thesis, 2005. http://www.wpi.edu/Pubs/ETD/Available/etd-01115-225546/.

Full text

APA, Harvard, Vancouver, ISO, and other styles

31

Dill, Robert W. "Data warehousing and data quality for a Spatial Decision Support System." Thesis, Monterey, Calif. : Springfield, Va. : Naval Postgraduate School ; Available from National Technical Information Service, 1997. http://handle.dtic.mil/100.2/ADA336886.

Full text

Abstract:

Thesis (M.S. in Information Technology Management) Naval Postgraduate School, Sept. 1997.
Thesis advisors, Daniel R. Dolk, George W. Thomas, and Kathryn Kocher. Includes bibliographical references (p. 203-206). Also available online.

APA, Harvard, Vancouver, ISO, and other styles

32

Reinert, Olof, and Tobias Wiesinger. "DATA QUALITY CONSEQUENCES OF MANDATORY CYBER DATA SHARING BETWEEN DUOPOLY INSURERS." Thesis, Umeå universitet, Institutionen för matematik och matematisk statistik, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-175180.

Full text

Abstract:

Cyber attacks against companies are becoming more common as technology advances and digitalization is increasing exponentially. All Swedish insurance companies that sell cyber insurance encounter the same problem, there is not enough data to do good actuarial work. In order for the pricing procedure to improve and general knowledge of cyber insurance to increase, it has been proposed that insurance companies should share their data with each other. The goal of the thesis is to do mathematical calculations to explore data quality consequences of such a sharing regime. This thesis is based on some important assumptions and three scenarios. The most important assumptions are that there are two insurance companies forced to share all their data with each other and that they can reduce the uncertainty about their own product by investing in better data quality. In the first scenario, we assume a game between two players where they can choose how much to invest in reducing the uncertainty. In the second scenario, we assume that there is not a game, but the two insurance companies are forced to equal investments and thus have the same knowledge of their products. In the third scenario, we assume that the players are risk averse, that is, they are not willing to take high risk. The results will show how much, if any, the insurance companies should invest in the different scenarios to maximize their profits (if risk neutral) or utility (if risk averse). The results of this thesis show that in the first and second scenario, the optimal profit is reached when the insurance companies do not invest anything. In the third scenario though, the optimal investment is greater than zero, given that the companies are enough risk averse.

APA, Harvard, Vancouver, ISO, and other styles

33

Alkharboush, Nawaf Abdullah H. "A data mining approach to improve the automated quality of data." Thesis, Queensland University of Technology, 2014. https://eprints.qut.edu.au/65641/1/Nawaf%20Abdullah%20H_Alkharboush_Thesis.pdf.

Full text

Abstract:

This thesis describes the development of a robust and novel prototype to address the data quality problems that relate to the dimension of outlier data. It thoroughly investigates the associated problems with regards to detecting, assessing and determining the severity of the problem of outlier data; and proposes granule-mining based alternative techniques to significantly improve the effectiveness of mining and assessing outlier data.

APA, Harvard, Vancouver, ISO, and other styles

34

SPAHIU, BLERINA. "Profiling Linked Data." Doctoral thesis, Università degli Studi di Milano-Bicocca, 2017. http://hdl.handle.net/10281/151645.

Full text

Abstract:

Nonostante l'elevato numero di dati pubblicati come LD, il loro utilizzo non ha ancora mostrato il loro potenziale per l’assenza di comprensione dei metadati. I consumatori di dati hanno bisogno di ottenere informazioni dai dataset in modo veloce e concentrato per poter decidere se sono utili per il loro problema oppure no. Le tecniche di profilazione dei dati offrono una soluzione efficace a questo problema in quanto sono utilizzati per generare metadati e statistiche che descrivono il contenuto dei dataset. Questa tesi presenta una ricerca, che affronta i problemi legati alla profilazione Linked Data. Nonostante il termine profilazione dei dati è usato in modo generico per diverse informazioni che descrivono i dataset, in questa tesi noi andiamo a ricoprire tre aspetti della profilazione; topic-based, schema-based e linkage-based. Il profilo proposto in questa tesi è fondamentale per il processo decisionale ed è la base dei requisiti che portano verso la comprensione dei dataset. In questa tesi presentiamo un approccio per classificare automaticamente insiemi di dati in una delle categorie utilizzate nel mondo dei LD. Inoltre, indaghiamo il problema della profilazione multi-topic. Per la profilazione schema-based proponiamo un approccio riassuntivo schema-based, che fornisce una panoramica sui rapporti nei dati. I nostri riassunti sono concisi e chiari sufficientemente per riassumere l'intero dataset. Inoltre, essi rivelano problemi di qualità e possono aiutare gli utenti nei compiti di formulazione dei query. Molti dataset nel LD cloud contengono informazioni simili per la stessa entità. Al fine di sfruttare appieno il suo potenziale LD bisogna far vedere questa informazione in modo esplicito. Profiling Linkage fornisce informazioni sul numero di entità equivalenti tra i dataset e rivela possibili errori.Le tecniche di profiling sviluppate durante questo lavoro sono automatiche e possono essere applicate a differenti insiemi di dati indipendentemente dal dominio.
Recently, the increasing diffusion of Linked Data (LD) as a standard way to publish and structure data on the Web has received a growing attention from researchers and data publishers. LD adoption is reflected in different domains such as government, media, life science, etc., building a powerful Web available to anyone. Despite the high number of datasets published as LD, their usage is still not exploited as they lack comprehensive metadata. Data consumers need to obtain information about datasets content in a fast and summarized form to decide if they are useful for their use case at hand or not. Data profiling techniques offer an efficient solution to this problem as they are used to generate metadata and statistics that describe the content of the dataset. Existing profiling techniques do no cover a wide range of use cases. Many challenges due to the heterogeneity nature of Linked Data are still to overcome. This thesis presents the doctoral research which tackles the problems related to Profiling Linked Data. Even though the term of data profiling is the umbrella term for diverse descriptive information that describes a dataset, in this thesis we cover three aspects of profiling; topic-based, schema-based and linkage-based. The profile provided in this thesis is fundamental for the decision-making process and is the basic requirement towards the dataset understanding. In this thesis we present an approach to automatically classify datasets in one of the topical categories used in the LD cloud. Moreover, we investigate the problem of multi-topic profiling. For the schema-based profiling we propose a schema-based summarization approach, that provides an overview about the relations in the data. Our summaries are concise and informative enough to summarize the whole dataset. Moreover, they reveal quality issues and can help users in the query formulation tasks. Many datasets in the LD cloud contain similar information for the same entity. In order to fully exploit its potential LD should made this information explicit. Linkage profiling provides information about the number of equivalent entities between datasets and reveal possible errors. The techniques of profiling developed during this work are automatic and can be applied to different datasets independently of the domain.

APA, Harvard, Vancouver, ISO, and other styles

35

Cui, Qingguang. "Measuring data abstraction quality in multiresolution visualizations." Worcester, Mass. : Worcester Polytechnic Institute, 2007. http://www.wpi.edu/Pubs/ETD/Available/etd-041107-224152/.

Full text

APA, Harvard, Vancouver, ISO, and other styles

36

Mueller, G. "Data Consistency Checks on Flight Test Data." International Foundation for Telemetering, 2014. http://hdl.handle.net/10150/577405.

Full text

Abstract:

ITC/USA 2014 Conference Proceedings / The Fiftieth Annual International Telemetering Conference and Technical Exhibition / October 20-23, 2014 / Town and Country Resort & Convention Center, San Diego, CA
This paper reflects the principal results of a study performed internally by Airbus's flight test centers. The purpose of this study was to share the body of knowledge concerning data consistency checks between all Airbus business units. An analysis of the test process is followed by the identification of the process stakeholders involved in ensuring data consistency. In the main part of the paper several different possibilities for improving data consistency are listed; it is left to the discretion of the reader to determine the appropriateness these methods.

APA, Harvard, Vancouver, ISO, and other styles

37

Schmidt, Sven. "Quality of service aware data stream processing." Doctoral thesis, [S.l.] : [s.n.], 2007. http://deposit.ddb.de/cgi-bin/dokserv?idn=983780625.

Full text

APA, Harvard, Vancouver, ISO, and other styles

38

Tardif, Geneviève. "Multivariate Analysis of Canadian Water Quality Data." Thesis, Université d'Ottawa / University of Ottawa, 2015. http://hdl.handle.net/10393/32245.

Full text

Abstract:

Physical-chemical water quality data from lotic water monitoring sites across Canada were integrated into one dataset. Two overlapping matrices of data were analyzed with principal component analysis (PCA) and cluster analysis to uncover structure and patterns in the data. The first matrix (Matrix A) had 107 sites located throughout Canada, and the following water quality parameters: pH, specific conductance (SC), and total phosphorus (TP). The second matrix (Matrix B) included more variables: calcium (Ca), chloride (Cl), total alkalinity (T_ALK), dissolved oxygen (DO), water temperature (WT), pH, SC and TP; for a subset of 42 sites. Landscape characteristics were calculated for each water quality monitoring site and their importance in explaining water quality data was examined through redundancy analysis. The first principal components in the analyses of Matrix A and B were most correlated with SC, suggesting this parameter is the most representative of water quality variance at the scale of Canada. Overlaying cluster analysis results on PCA information proved an excellent mean to identify the major water characteristics defining each group; mapping cluster analysis group membership provided information on their spatial distribution and was found informative with regards to the probable environmental influences on each group. Redundancy analyses produced significant predictive models of water quality demonstrating that landscape characteristics are determinant factors in water quality at the country scale. The proportion of cropland and the mean annual total precipitation in the drainage area were the landscape variables with the most variance explained. Assembling a consistent dataset of water quality data from monitoring locations throughout Canada proved difficult due to the unevenness of the monitoring programs in place. It is therefore recommended that a standard for the monitoring of a minimum core set of water quality variable be implemented throughout the country to support future nation-wide analysis of water quality data.

APA, Harvard, Vancouver, ISO, and other styles

39

Smith, Sonya K. "Assessing the quality of deep seismic data." Thesis, University of Cambridge, 1994. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.361690.

Full text

APA, Harvard, Vancouver, ISO, and other styles

40

Karamancı, Kaan. "Exploratory data analysis for preemptive quality control." Thesis, Massachusetts Institute of Technology, 2009. http://hdl.handle.net/1721.1/53126.

Full text

Abstract:

Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2009.
Includes bibliographical references (p. 113).
In this thesis, I proposed and implemented a methodology to perform preemptive quality control on low-tech industrial processes with abundant process data. This involves a 4 stage process which includes understanding the process, interpreting and linking the available process parameter and quality control data, developing an exploratory data toolset and presenting the findings in a visual and easily implementable fashion. In particular, the exploratory data techniques used rely on visual human pattern recognition through data projection and machine learning techniques for clustering. The presentation of finding is achieved via software that visualizes high dimensional data with Chernoff faces. Performance is tested on both simulated and real industry data. The data obtained from a company was not suitable, but suggestions on how to collect suitable data was given.
by Kaan Karamancı.
M.Eng.

APA, Harvard, Vancouver, ISO, and other styles

41

Schnetzer, Matthias, Franz Astleithner, Predrag Cetkovic, Stefan Humer, Manuela Lenk, and Mathias Moser. "Quality Assessment of Imputations in Administrative Data." De Gruyter, 2015. http://dx.doi.org/10.1515/JOS-2015-0015.

Full text

Abstract:

This article contributes a framework for the quality assessment of imputations within a broader structure to evaluate the quality of register-based data. Four quality-related hyperdimensions examine the data processing from the raw-data level to the final statistics. Our focus lies on the quality assessment of different imputation steps and their influence on overall data quality. We suggest classification rates as a measure of accuracy of imputation and derive several computational approaches. (authors' abstract)

APA, Harvard, Vancouver, ISO, and other styles

42

Law, Eugene L. "CORRELATION BETWEEN TAPE DROPOUTS AND DATA QUALITY." International Foundation for Telemetering, 1990. http://hdl.handle.net/10150/613460.

Full text

Abstract:

International Telemetering Conference Proceedings / October 29-November 02, 1990 / Riviera Hotel and Convention Center, Las Vegas, Nevada
This paper will present the results of a study to correlate tape dropouts and data quality. A tape dropout is defined in the Telemetry Standards as “a reproduced signal of abnormally 1 low amplitude caused by tape imperfections severe enough to produce a data error” Bit errors were chosen as the measure of data quality. Signals were recorded on several tracks of a wideband analog instrumentation magnetic tape recorder. The tape tracks were 50 mils wide. The signal characteristics were analyzed when bit errors or low reproduce amplitudes were detected.

APA, Harvard, Vancouver, ISO, and other styles

43

Nilsson, Petter. "Improving Data Quality in Swedbank Swedish DataWarehouse." Thesis, Umeå universitet, Institutionen för datavetenskap, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-128236.

Full text

Abstract:

Poor data quality is very much present in large organizational databases worldwide, such as data warehouses (DWs). The purpose of this thesis is to propose improvements in how data quality issues are handled in Swedbank Swedish Data Warehouse. The thesis starts with a general explanation of what a DW is and what usage areas it has. After that, the concept of data quality and its dimensions is examined. Typical issues with data quality that may arise and how to solve them are covered. Swedbank’s architecture is described, along with the problems they face, and how they try to handle them as of now. Finally, improvement areas in their methods are highlighted, and general strategies, based on done research, for avoiding data quality issues are proposed. Hopefully, the propositions given in this thesis can be of use to others facing similar issues.

APA, Harvard, Vancouver, ISO, and other styles

44

Veiga, Allan Koch. "A conceptual framework on biodiversity data quality." Universidade de São Paulo, 2016. http://www.teses.usp.br/teses/disponiveis/3/3141/tde-17032017-085248/.

Full text

Abstract:

The increasing availability of digitized biodiversity data worldwide, provided by an increasing number of sources, and the growing use of those data for a variety of purposes have raised concerns related to the \"fitness for use\" of such data and the impact of data quality (DQ) on outcomes of analyses, reports and decisions making. A consistent approach to assess and manage DQ is currently critical for biodiversity data users. However, achieving this goal has been particularly challenging because of the idiosyncrasies inherent to the concept of quality. DQ assessment and management cannot be suitably carried out if we have not clearly established the meaning of quality according to the data user\'s standpoint. This thesis presents a formal conceptual framework to support the Biodiversity Informatics (BI) community to consistently describe the meaning of data \"fitness for use\". Principles behind data fitness for use are used to establish a formal and common ground for the collaborative definition of DQ needs, solutions and reports useful for DQ assessment and management. Based on the study of the DQ domain and its contextualization in the BI domain, which involved discussions with experts in DQ and BI in an iterative process, a comprehensive framework was designed and formalized. The framework defines eight fundamental concepts and 21 derived concepts, organized into three classes: DQ Needs, DQ Solutions and DQ Report. The concepts of each class describe, respectively, the meaning of DQ in a given context, the methods and tools that can serve as solutions for meeting DQ needs, and reports that present the current status of quality of a data resource. The formalization of the framework was presented using conceptual maps notation and sets theory notation. In order to validate the framework, we present a proof of concept based on a case study conducted at the Museum of Comparative Zoology of Harvard University. The tools FP-Akka Kurator and the BDQ Toolkit were used in the case study to perform DQ measures, validations and improvements in a dataset of the Arizona State University Hasbrouck Insect Collection. The results illustrate how the framework enables data users to assess and manage DQ of datasets and single records using quality control and quality assurance approaches. The proof of concept has also shown that the framework is adequately formalized and flexible, and sufficiently complete for defining DQ needs, solutions and reports in the BI domain. The framework is able of formalizing human thinking into well-defined components to make it possible sharing and reusing definitions of DQ in different scenarios, describing and finding DQ tools and services, and communicating the current status of quality of data in a standardized format among the stakeholders. In addition, the framework supports the players of that community to join efforts on the collaborative gathering and developing of the necessary components for the DQ assessment and management in different contexts. The framework is also the foundation of a Task Group on Data Quality, under the auspices of the Biodiversity Information Standards (TDWG) and the Global Biodiversity Information Facility (GBIF) and is being used to help collect user\'s needs on data quality on agrobiodiversity and on species distributed modeling, initially. In future work, we plan to use the framework to engage the BI community to formalize and share DQ profiles related to a number of other data usages, to recommend methods, guidelines, protocols, metadata schemas and controlled vocabulary for supporting data fitness for use assessment and management in distributed system and data environments. In addition, we plan to build a platform based on the framework to serve as a common backbone for registering and retrieving DQ concepts, such as DQ profiles, methods, tools and reports.
A crescente disponibilização de dados digitalizados sobre a biodiversidade em todo o mundo, fornecidos por um crescente número de fontes, e o aumento da utilização desses dados para uma variedade de propósitos, tem gerado preocupações relacionadas a \"adequação ao uso\" desses dados e ao impacto da qualidade de dados (QD) sobre resultados de análises, relatórios e tomada de decisões. Uma abordagem consistente para avaliar e gerenciar a QD é atualmente crítica para usuários de dados sobre a biodiversidade. No entanto, atingir esse objetivo tem sido particularmente desafiador devido à idiossincrasia inerente ao conceito de qualidade. A avaliação e a gestão da QD não podem ser adequadamente realizadas sem definir claramente o significado de qualidade de acordo com o ponto de vista do usuário dos dados. Esta tese apresenta um arcabouço conceitual formal para apoiar a comunidade de Informática para Biodiversidade (IB) a descrever consistentemente o significado de \"adequação ao uso\" de dados. Princípios relacionados à adequação ao uso são usados para estabelecer uma base formal e comum para a definição colaborativa de necessidades, soluções e relatórios de QD úteis para a avaliação e gestão de QD. Baseado no estudo do domínio de QD e sua contextualização no domínio de IB, que envolveu discussões com especialistas em QD e IB em um processo iterativo, foi projetado e formalizado um arcabouço conceitual abrangente. Ele define oito conceitos fundamentais e vinte e um conceitos derivados organizados em três classes: Necessidades de QD, Soluções de QD e Relatório de QD. Os conceitos de cada classe descrevem, respectivamente, o significado de QD em um dado contexto, métodos e ferramentas que podem servir como soluções para atender necessidades de QD, e relatórios que apresentam o estado atual da qualidade de um recurso de dado. A formalização do arcabouço foi apresentada usando notação de mapas conceituais e notação de teoria dos conjuntos. Para a validação do arcabouço, nós apresentamos uma prova de conceito baseada em um estudo de caso conduzido no Museu de Zoologia Comparativa da Universidade de Harvard. As ferramentas FP-Akka Kurator e BDQ Toolkit foram usadas no estudo de caso para realizar medidas, validações e melhorias da QD em um conjunto de dados da Coleção de Insetos Hasbrouck da Universidade do Estado do Arizona. Os resultados ilustram como o arcabouço permite a usuários de dados avaliarem e gerenciarem a QD de conjunto de dados e registros isolados usando as abordagens de controle de qualidade a garantia de qualidade. A prova de conceito demonstrou que o arcabouço é adequadamente formalizado e flexível, e suficientemente completo para definir necessidades, soluções e relatórios de QD no domínio da IB. O arcabouço é capaz de formalizar o pensamento humano em componentes bem definidos para fazer possível compartilhar e reutilizar definições de QD em diferentes cenários, descrever e encontrar ferramentas de QD e comunicar o estado atual da qualidade dos dados em um formato padronizado entre as partes interessadas da comunidade de IB. Além disso, o arcabouço apoia atores da comunidade de IB a unirem esforços na identificação e desenvolvimento colaborativo de componentes necessários para a avaliação e gestão da QD. O arcabouço é também o fundamento de um Grupos de Trabalho em Qualidade de Dados, sob os auspícios do Biodiversity Information Standard (TDWG) e do Biodiversity Information Facility (GBIF) e está sendo utilizado para coletar as necessidades de qualidade de dados de usuários de dados de agrobiodiversidade e de modelagem de distribuição de espécies, inicialmente. Em trabalhos futuros, planejamos usar o arcabouço apresentado para engajar a comunidade de IB para formalizar e compartilhar perfis de QD relacionados a inúmeros outros usos de dados, recomendar métodos, diretrizes, protocolos, esquemas de metadados e vocabulários controlados para apoiar a avaliação e gestão da adequação ao uso de dados em ambiente de sistemas e dados distribuídos. Além disso, nós planejamos construir uma plataforma baseada no arcabouço para servir como uma central integrada comum para o registro e recuperação de conceitos de QD, tais como perfis, métodos, ferramentas e relatórios de QD.

APA, Harvard, Vancouver, ISO, and other styles

45

Schmidt, Sven. "Quality-of-Service-Aware Data Stream Processing." Doctoral thesis, Technische Universität Dresden, 2006. https://tud.qucosa.de/id/qucosa%3A23955.

Full text

Abstract:

Data stream processing in the industrial as well as in the academic field has gained more and more importance during the last years. Consider the monitoring of industrial processes as an example. There, sensors are mounted to gather lots of data within a short time range. Storing and post-processing these data may occasionally be useless or even impossible. On the one hand, only a small part of the monitored data is relevant. To efficiently use the storage capacity, only a preselection of the data should be considered. On the other hand, it may occur that the volume of incoming data is generally too high to be stored in time or–in other words–the technical efforts for storing the data in time would be out of scale. Processing data streams in the context of this thesis means to apply database operations to the stream in an on-the-fly manner (without explicitly storing the data). The challenges for this task lie in the limited amount of resources while data streams are potentially infinite. Furthermore, data stream processing must be fast and the results have to be disseminated as soon as possible. This thesis focuses on the latter issue. The goal is to provide a so-called Quality-of-Service (QoS) for the data stream processing task. Therefore, adequate QoS metrics like maximum output delay or minimum result data rate are defined. Thereafter, a cost model for obtaining the required processing resources from the specified QoS is presented. On that basis, the stream processing operations are scheduled. Depending on the required QoS and on the available resources, the weight can be shifted among the individual resources and QoS metrics, respectively. Calculating and scheduling resources requires a lot of expert knowledge regarding the characteristics of the stream operations and regarding the incoming data streams. Often, this knowledge is based on experience and thus, a revision of the resource calculation and reservation becomes necessary from time to time. This leads to occasional interruptions of the continuous data stream processing, of the delivery of the result, and thus, of the negotiated Quality-of-Service. The proposed robustness concept supports the user and facilitates a decrease in the number of interruptions by providing more resources.

APA, Harvard, Vancouver, ISO, and other styles

46

Hillhouse, Linden, and Ginette Blackhart. "Data Quality: Does Time of Semester Matter?" Digital Commons @ East Tennessee State University, 2019. https://dc.etsu.edu/asrf/2019/schedule/84.

Full text

Abstract:

When conducting scientific research, obtaining high-quality data is important. When collecting data from a college student participant pool, however, factors such as the time of the semester in which data are collected could cause validity issues, especially if the survey is completed in an online, non-laboratory setting. Near the end of the semester, students may experience more time pressures and constraints than at other times in the semester. These additional pressures may encourage participants to multi-task while completing the study, or to rush through the survey in order to receive credits as quickly as possible. The hypothesis of this study was that responses collected at the end of the semester would exhibit lower data quality than responses collected at the beginning of the semester. Data were collected online during the last two weeks of the fall 2018 semester (n = 312) and the first two weeks of the spring 2019 semester (n = 55). Participants were asked to write about an embarrassing situation and then completed a number of questionnaires assessing their thoughts and feelings about the event, personality traits, and participant engagement. Data quality was assessed using several different previously validated methods, including time spent on survey; the number of missed items; the number of incorrect embedded attention-check items (out of 12); the length of responses on two open-ended questions; self-reported diligence, interest, effort, attention, and whether their data should be used; and Cronbach’s alphas on the scales. Results showed that between the two groups, there were significant differences on length of open-ended responses, self-reported diligence, self-reported interest, effort, attention, neuroticism, and conscientiousness. Participants completing the study in the first two weeks of the spring 2019 semester had significantly longer open-ended responses and significantly higher levels of self-reported diligence, self-reported interest, effort, attention, neuroticism, and conscientiousness. Although there was not a significant difference in number of incorrect attention-check items between the two groups, it should be noted that only 46% of the total participants did not miss any check items. These results lend support to the hypothesis that data collected at the end of the semester may be of lower quality than data collected at the beginning of the semester. However, because the groups significantly differed on neuroticism and conscientiousness, we cannot determine whether the time of semester effect is a product of internal participant characteristics or external pressures. Nevertheless, researchers should take into account this end-of-semester data quality difference when deciding the time-frame of their data collection.

APA, Harvard, Vancouver, ISO, and other styles

47

Silberbauer, Michael John. "Methods for visualising complex water quality data." Doctoral thesis, University of Cape Town, 2009. http://hdl.handle.net/11427/12148.

Full text

Abstract:

Includes abstract.
Includes bibliographical references (leaves 157-173).
The quality of South Africa’s over-stretched water resources is a matter of concern for all who depend on them for their survival and prosperity, so access to the relevant monitoring data is essential. Visualisation is a powerful method for analysing these data and communicating the results, because it unloads complex cognitive processes from the fairly restricted human numerical processing structures onto the highly developed visual perception system. Developments in the field of visualisation during the past two decades have yielded many practical methods that are applicable to the analysis and presentation of water quality data. Judicious use of visualisation aids aquatic scientists, water resource managers and ordinary consumers in assessing the quality of their water and deciding on remedial measures. To provide some insight into the possibilities of visualisation techniques, I analyse and discuss five visual methods that I have developed or contributed to: multivariate time-series inventory plots; multivariate map symbols; spatially-referenced inventory of water quality data; mass transfer summary plots; and the use of visual methods in communicating the ecological status of rivers to a wide audience.

APA, Harvard, Vancouver, ISO, and other styles

48

RULA, ANISA. "Time-related quality dimensions in linked data." Doctoral thesis, Università degli Studi di Milano-Bicocca, 2014. http://hdl.handle.net/10281/81717.

Full text

Abstract:

Over the last few years, there has been an increasing di↵usion of Linked Data as a standard way to publish interlinked structured data on the Web, which allows users, and public and private organizations to fully exploit a large amount of data from several domains that were not available in the past. Although gathering and publishing such massive amount of structured data is certainly a step in the right direction, quality still poses a significant obstacle to the uptake of data consumption applications at large-scale. A crucial aspect of quality regards the dynamic nature of Linked Data where information can change rapidly and fail to reflect changes in the real world, thus becoming out-date. Quality is characterised by di↵erent dimensions that capture several aspects of quality such as accuracy, currency, consistency or completeness. In particular, the aspects of Linked Data dynamicity are captured by Time-Related Quality Dimen- sions such as data currency. The assessment of Time-Related Quality Dimensions, which is the task of measuring the quality, is based on temporal information whose collection poses several challenges regarding their availability, representation and diversity in Linked Data. The assessment of Time-Related Quality Dimensions supports data consumers in their decisions whether information are valid or not. The main goal of this thesis is to develop techniques for assessing Time-Related Quality Dimensions in Linked Data, which must overcome several challenges posed by Linked Data such as third-party applications, variety of data, high volume of data or velocity of data. The major contributions of this thesis can be summarized as follows: it presents a general settings of definitions for quality dimensions and measures adopted in Linked Data; it provides a large-scale analysis of approaches for representing temporal information in Linked Data; it provides a sharable and interoperable conceptual model which integrates vocabularies used to represent temporal information required for the assessment of Time-Related Quality Di- mensions; it proposes two domain-independent techniques to assess data currency that work with incomplete or inaccurate temporal information and finally it pro- vides an approach that enrich information with time intervals representing their temporal validity.

APA, Harvard, Vancouver, ISO, and other styles

49

Pozzoli, Alice. "Data and quality metadata for continuous fields." Lyon, INSA, 2008. http://theses.insa-lyon.fr/publication/2008ISAL0024/these.pdf.

Full text

Abstract:

This thesis deals with data processing in Geomatics. It ranges from data acquisition in Photogrammetry to data representation as well as in Cartography. The objective of this research was to use statistical techniques of data processing for the creation of digital surface models starting from photogrammetric images. The main function of photogrammetry is the transformation of data coming from the image space to the object space. An easy solution for three image orientation is proposed. The orientation procedure described has some relevant advantages for environmental and monitoring applications, and makes it a very powerful tool in addition to more traditional methodologies. Among many different applications, an interesting project for the survey of a hydraulic 3D model of a stream confluence in the mountain area has been performed. From a computing point of view, we propose a description of the photogrammetric data based on the XML format for geographic data (Geographic Markup Language). The aim is to optimize the archiving and management of geo-data. As a conclusion, an original software product which allows to model terrains starting from three-image photogrammetry has been developed and tested
Le sujet principal de ma thèse est le traitement des données en géomatique allant de l’acquisition des données photogrammétriques à la représentation cartographique. L’objectif de ma recherche est ainsi l’utilisation des techniques statistiques pour le traitement des données géomatiques afin de créer des modèles numériques des terrains en partant des données photogrammetriques. La fonction principale de la Photogrammétrie est la transformation des données en partant de l’espace-image à l’espace-objet. Nous avons proposé une solution pratique pour l’orientation automatique à partir de trois images. Cette méthodologie d’orientation présente de nombreux avantages pour les applications environnementales et de surveillance, et elle est un puissant instrument que l’on peut utiliser à côté de méthodologies plus traditionnelles. Parmi diverses applications possibles, on a choisi de construire le relief d’un modèle hydraulique 3D qui représente la confluence de deux torrents dans une région montagneuse. D’un point de vue informatique, nous avons proposé une description de données photogrammétriques basée sur le format XML pour les données géographiques (extension de GML, Geographic Markup Language). L’objectif est d’optimiser l’archivage et la gestion des données géomatiques. Enfin, un logiciel original a été produit, qui permet de modéliser les terrains en utilisant la photogrammétrie à trois images

APA, Harvard, Vancouver, ISO, and other styles

50

Mehanna, Souheir. "Data quality issues in mobile crowdsensing environments." Electronic Thesis or Diss., université Paris-Saclay, 2023. http://www.theses.fr/2023UPASG053.

Full text

Abstract:

Les environnements de capteurs mobiles sont devenus le paradigme de référence pour exploiter les capacités de collecte des appareils mobiles et recueillir des données variées en conditions réelles. Pour autant, garantir la qualité des données recueillies reste une tâche complexe car les capteurs, souvent à bas coûts et ne fonctionnant pas toujours de façon optimale, peuvent être sujets à des dysfonctionnements, des erreurs, voire des pannes. Comme la qualité des données a un impact direct et significatif sur les résultats des analyses ultérieures, il est crucial de l'évaluer. Dans notre travail, nous nous intéressons à deux problématiques majeures liées à la qualité des données recueillies par les environnements de capteurs mobiles.Nous nous intéressons en premier à la complétude des données et nous proposons un ensemble de facteurs de qualité adapté à ce contexte, ainsi que des métriques permettant de les évaluer. En effet, les facteurs et métriques existants ne capturent pas l'ensemble des caractéristiques associées à la collecte de données par des capteurs. Afin d'améliorer la complétude des données, nous nous sommes intéressés au problème de génération des données manquantes. Les techniques actuelles d'imputation de données génèrent les données manquantes en se reposant sur les données existantes, c'est à dire les mesures déjà réalisées par les capteurs, sans tenir compte de la qualité de ces données qui peut être très variable. Nous proposons donc une approche qui étend les techniques existantes pour permettre la prise en compte de la qualité des données pendant l'imputation. La deuxième partie de nos travaux est consacrée à la détection d'anomalies dans les données de capteurs. Tout comme pour l'imputation de données, les techniques permettant de détecter des anomalies utilisent des métriques sur les données mais ignorent la qualité des ces dernières. Pour améliorer la détection, nous proposons une approche fondés sur des algorithmes de clustering qui intègrent la qualité des capteurs dans le processus de détection des anomalies.Enfin, nous nous sommes intéressés à la façon dont la qualité des données pourrait être prise en compte lors de l'analyse de données issues de capteurs. Nous proposons deux contributions préliminaires: des opérateurs d'agrégation qui considère la qualité des mesures, et une approche pour évaluer la qualité d'un agrégat en fonction des données utilisées dans son calcul
Mobile crowdsensing has emerged as a powerful paradigm for harnessing the collective sensing capabilities of mobile devices to gather diverse data in real-world settings. However, ensuring the quality of the collected data in mobile crowdsensing environments (MCS) remains a challenge because low-cost nomadic sensors can be prone to malfunctions, faults, and points of failure. The quality of the collected data can significantly impact the results of the subsequent analyses. Therefore, monitoring the quality of sensor data is crucial for effective analytics.In this thesis, we have addressed some of the issues related to data quality in mobile crowdsensing environments. First, we have explored issues related to data completeness. The mobile crowdsensing context has specific characteristics that are not all captured by the existing factors and metrics. We have proposed a set of quality factors of data completeness suitable for mobile crowdsensing environments. We have also proposed a set of metrics to evaluate each of these factors. In order to improve data completeness, we have tackled the problem of generating missing values.Existing data imputation techniques generate missing values by relying on existing measurements without considering the disparate quality levels of these measurements. We propose a quality-aware data imputation approach that extends existing data imputation techniques by taking into account the quality of the measurements.In the second part of our work, we have focused on anomaly detection, which is another major problem that sensor data face. Existing anomaly detection approaches use available data measurements to detect anomalies, and are oblivious of the quality of the measurements. In order to improve the detection of anomalies, we propose an approach relying on clustering algorithms that detects pattern anomalies while integrating the quality of the sensor into the algorithm.Finally, we have studied the way data quality could be taken into account for analyzing sensor data. We have proposed some contributions which are the first step towards quality-aware sensor data analytics, which consist of quality-aware aggregation operators, and an approach that evaluates the quality of a given aggregate considering the data used in its computation

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!