To see the other types of publications on this topic, follow the link: Data cleaning.

Dissertations / Theses on the topic 'Data cleaning'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 dissertations / theses for your research on the topic 'Data cleaning.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Liebchen, Gernot Armin. "Data cleaning techniques for software engineering data sets." Thesis, Brunel University, 2010. http://bura.brunel.ac.uk/handle/2438/5951.

Full text
Abstract:
Data quality is an important issue which has been addressed and recognised in research communities such as data warehousing, data mining and information systems. It has been agreed that poor data quality will impact the quality of results of analyses and that it will therefore impact on decisions made on the basis of these results. Empirical software engineering has neglected the issue of data quality to some extent. This fact poses the question of how researchers in empirical software engineering can trust their results without addressing the quality of the analysed data. One widely accepted definition for data quality describes it as `fitness for purpose', and the issue of poor data quality can be addressed by either introducing preventative measures or by applying means to cope with data quality issues. The research presented in this thesis addresses the latter with the special focus on noise handling. Three noise handling techniques, which utilise decision trees, are proposed for application to software engineering data sets. Each technique represents a noise handling approach: robust filtering, where training and test sets are the same; predictive filtering, where training and test sets are different; and filtering and polish, where noisy instances are corrected. The techniques were first evaluated in two different investigations by applying them to a large real world software engineering data set. In the first investigation the techniques' ability to improve predictive accuracy in differing noise levels was tested. All three techniques improved predictive accuracy in comparison to the do-nothing approach. The filtering and polish was the most successful technique in improving predictive accuracy. The second investigation utilising the large real world software engineering data set tested the techniques' ability to identify instances with implausible values. These instances were flagged for the purpose of evaluation before applying the three techniques. Robust filtering and predictive filtering decreased the number of instances with implausible values, but substantially decreased the size of the data set too. The filtering and polish technique actually increased the number of implausible values, but it did not reduce the size of the data set. Since the data set contained historical software project data, it was not possible to know the real extent of noise detected. This led to the production of simulated software engineering data sets, which were modelled on the real data set used in the previous evaluations to ensure domain specific characteristics. These simulated versions of the data set were then injected with noise, such that the real extent of the noise was known. After the noise injection the three noise handling techniques were applied to allow evaluation. This procedure of simulating software engineering data sets combined the incorporation of domain specific characteristics of the real world with the control over the simulated data. This is seen as a special strength of this evaluation approach. The results of the evaluation of the simulation showed that none of the techniques performed well. Robust filtering and filtering and polish performed very poorly, and based on the results of this evaluation they would not be recommended for the task of noise reduction. The predictive filtering technique was the best performing technique in this evaluation, but it did not perform significantly well either. An exhaustive systematic literature review has been carried out investigating to what extent the empirical software engineering community has considered data quality. The findings showed that the issue of data quality has been largely neglected by the empirical software engineering community. The work in this thesis highlights an important gap in empirical software engineering. It provided clarification and distinctions of the terms noise and outliers. Noise and outliers are overlapping, but they are fundamentally different. Since noise and outliers are often treated the same in noise handling techniques, a clarification of the two terms was necessary. To investigate the capabilities of noise handling techniques a single investigation was deemed as insufficient. The reasons for this are that the distinction between noise and outliers is not trivial, and that the investigated noise cleaning techniques are derived from traditional noise handling techniques where noise and outliers are combined. Therefore three investigations were undertaken to assess the effectiveness of the three presented noise handling techniques. Each investigation should be seen as a part of a multi-pronged approach. This thesis also highlights possible shortcomings of current automated noise handling techniques. The poor performance of the three techniques led to the conclusion that noise handling should be integrated into a data cleaning process where the input of domain knowledge and the replicability of the data cleaning process are ensured.
APA, Harvard, Vancouver, ISO, and other styles
2

Li, Lin. "Data quality and data cleaning in database applications." Thesis, Edinburgh Napier University, 2012. http://researchrepository.napier.ac.uk/Output/5788.

Full text
Abstract:
Today, data plays an important role in people's daily activities. With the help of some database applications such as decision support systems and customer relationship management systems (CRM), useful information or knowledge could be derived from large quantities of data. However, investigations show that many such applications fail to work successfully. There are many reasons to cause the failure, such as poor system infrastructure design or query performance. But nothing is more certain to yield failure than lack of concern for the issue of data quality. High quality of data is a key to today's business success. The quality of any large real world data set depends on a number of factors among which the source of the data is often the crucial factor. It has now been recognized that an inordinate proportion of data in most data sources is dirty. Obviously, a database application with a high proportion of dirty data is not reliable for the purpose of data mining or deriving business intelligence and the quality of decisions made on the basis of such business intelligence is also unreliable. In order to ensure high quality of data, enterprises need to have a process, methodologies and resources to monitor and analyze the quality of data, methodologies for preventing and/or detecting and repairing dirty data. This thesis is focusing on the improvement of data quality in database applications with the help of current data cleaning methods. It provides a systematic and comparative description of the research issues related to the improvement of the quality of data, and has addressed a number of research issues related to data cleaning. In the first part of the thesis, related literature of data cleaning and data quality are reviewed and discussed. Building on this research, a rule-based taxonomy of dirty data is proposed in the second part of the thesis. The proposed taxonomy not only summarizes the most dirty data types but is the basis on which the proposed method for solving the Dirty Data Selection (DDS) problem during the data cleaning process was developed. This helps us to design the DDS process in the proposed data cleaning framework described in the third part of the thesis. This framework retains the most appealing characteristics of existing data cleaning approaches, and improves the efficiency and effectiveness of data cleaning as well as the degree of automation during the data cleaning process. Finally, a set of approximate string matching algorithms are studied and experimental work has been undertaken. Approximate string matching is an important part in many data cleaning approaches which has been well studied for many years. The experimental work in the thesis confirmed the statement that there is no clear best technique. It shows that the characteristics of data such as the size of a dataset, the error rate in a dataset, the type of strings in a dataset and even the type of typo in a string will have significant effect on the performance of the selected techniques. In addition, the characteristics of data also have effect on the selection of suitable threshold values for the selected matching algorithms. The achievements based on these experimental results provide the fundamental improvement in the design of 'algorithm selection mechanism' in the data cleaning framework, which enhances the performance of data cleaning system in database applications.
APA, Harvard, Vancouver, ISO, and other styles
3

Iyer, Vasanth. "Ensemble Stream Model for Data-Cleaning in Sensor Networks." FIU Digital Commons, 2013. http://digitalcommons.fiu.edu/etd/973.

Full text
Abstract:
Ensemble Stream Modeling and Data-cleaning are sensor information processing systems have different training and testing methods by which their goals are cross-validated. This research examines a mechanism, which seeks to extract novel patterns by generating ensembles from data. The main goal of label-less stream processing is to process the sensed events to eliminate the noises that are uncorrelated, and choose the most likely model without over fitting thus obtaining higher model confidence. Higher quality streams can be realized by combining many short streams into an ensemble which has the desired quality. The framework for the investigation is an existing data mining tool. First, to accommodate feature extraction such as a bush or natural forest-fire event we make an assumption of the burnt area (BA*), sensed ground truth as our target variable obtained from logs. Even though this is an obvious model choice the results are disappointing. The reasons for this are two: One, the histogram of fire activity is highly skewed. Two, the measured sensor parameters are highly correlated. Since using non descriptive features does not yield good results, we resort to temporal features. By doing so we carefully eliminate the averaging effects; the resulting histogram is more satisfactory and conceptual knowledge is learned from sensor streams. Second is the process of feature induction by cross-validating attributes with single or multi-target variables to minimize training error. We use F-measure score, which combines precision and accuracy to determine the false alarm rate of fire events. The multi-target data-cleaning trees use information purity of the target leaf-nodes to learn higher order features. A sensitive variance measure such as f-test is performed during each node’s split to select the best attribute. Ensemble stream model approach proved to improve when using complicated features with a simpler tree classifier. The ensemble framework for data-cleaning and the enhancements to quantify quality of fitness (30% spatial, 10% temporal, and 90% mobility reduction) of sensor led to the formation of streams for sensor-enabled applications. Which further motivates the novelty of stream quality labeling and its importance in solving vast amounts of real-time mobile streams generated today.
APA, Harvard, Vancouver, ISO, and other styles
4

Jia, Xibei. "From relations to XML : cleaning, integrating and securing data." Thesis, University of Edinburgh, 2008. http://hdl.handle.net/1842/3161.

Full text
Abstract:
While relational databases are still the preferred approach for storing data, XML is emerging as the primary standard for representing and exchanging data. Consequently, it has been increasingly important to provide a uniform XML interface to various data sources— integration; and critical to protect sensitive and confidential information in XML data — access control. Moreover, it is preferable to first detect and repair the inconsistencies in the data to avoid the propagation of errors to other data processing steps. In response to these challenges, this thesis presents an integrated framework for cleaning, integrating and securing data. The framework contains three parts. First, the data cleaning sub-framework makes use of a new class of constraints specially designed for improving data quality, referred to as conditional functional dependencies (CFDs), to detect and remove inconsistencies in relational data. Both batch and incremental techniques are developed for detecting CFD violations by SQL efficiently and repairing them based on a cost model. The cleaned relational data, together with other non-XML data, is then converted to XML format by using widely deployed XML publishing facilities. Second, the data integration sub-framework uses a novel formalism, XML integration grammars (XIGs), to integrate multi-source XML data which is either native or published from traditional databases. XIGs automatically support conformance to a target DTD, and allow one to build a large, complex integration via composition of component XIGs. To efficiently materialize the integrated data, algorithms are developed for merging XML queries in XIGs and for scheduling them. Third, to protect sensitive information in the integrated XML data, the data security sub-framework allows users to access the data only through authorized views. User queries posed on these views need to be rewritten into equivalent queries on the underlying document to avoid the prohibitive cost of materializing and maintaining large number of views. Two algorithms are proposed to support virtual XML views: a rewriting algorithm that characterizes the rewritten queries as a new form of automata and an evaluation algorithm to execute the automata-represented queries. They allow the security sub-framework to answer queries on views in linear time. Using both relational and XML technologies, this framework provides a uniform approach to clean, integrate and secure data. The algorithms and techniques in the framework have been implemented and the experimental study verifies their effectiveness and efficiency.
APA, Harvard, Vancouver, ISO, and other styles
5

Kokkonen, H. (Henna). "Effects of data cleaning on machine learning model performance." Bachelor's thesis, University of Oulu, 2019. http://jultika.oulu.fi/Record/nbnfioulu-201911133081.

Full text
Abstract:
Abstract. This thesis is focused on the preprocessing and challenges of a university student data set and how different levels of data preprocessing affect the performance of a prediction model both in general and in selected groups of interest. The data set comprises the students at the University of Oulu who were admitted to the Faculty of Information Technology and Electrical Engineering during years 2006–2015. This data set was cleaned at three different levels, which resulted in three differently processed data sets: one set is the original data set with only basic cleaning, the second has been cleaned out of the most obvious anomalies and the third has been systematically cleaned out of possible anomalies. Each of these data sets was used to build a Gradient Boosting Machine model that predicted the cumulative number of ECTS the students would achieve by the end of their second-year studies based on their first-year studies and the Matriculation Examination results. The effects of the cleaning on the model performance were examined by comparing the prediction accuracy and the information the models gave of the factors that might indicate a slow ECTS accumulation. The results showed that the prediction accuracy improved after each cleaning stage and the influences of the features altered significantly, becoming more reasonable.Datan siivouksen vaikutukset koneoppimismallin suorituskykyyn. Tiivistelmä. Tässä tutkielmassa keskitytään opiskelijadatan esikäsittelyyn ja haasteisiin sekä siihen, kuinka eritasoinen esikäsittely vaikuttaa ennustemallin suorituskykyyn sekä yleisesti että tietyissä kiinnostuksen kohteena olevissa ryhmissä. Opiskelijadata koostuu Oulun yliopiston Tieto- ja sähkötekniikan tiedekuntaan vuosina 2006–2015 valituista opiskelijoista. Tätä opiskelijadataa käsiteltiin kolmella eri tasolla, jolloin saatiin kolme eritasoisesti siivottua versiota alkuperäisestä datajoukosta. Ensimmäinen versio on alkuperäinen datajoukko, jolle on tehty vain perussiivous, toisessa versiossa datasta on poistettu vain ilmeisimmät poikkeavuudet ja kolmannessa versiossa datasta on systemaattisesti poistettu mahdolliset poikkeavuudet. Jokaisella datajoukolla opetettiin Gradient Boosting Machine koneoppismismalli ennustamaan opiskelijoiden opintopistekertymää toisen vuoden loppuun mennessä perustuen heidän ensimmäisen vuoden opintoihinsa ja ylioppilaskirjoitustensa tuloksiin. Datan eritasoisen siivouksen vaikutuksia mallin suorituskykyyn tutkittiin vertailemalla mallien ennustetarkkuutta sekä tietoa, jota mallit antoivat niistä tekijöistä, jotka voivat ennakoida hitaampaa opintopistekertymää. Tulokset osoittivat mallin ennustetarkkuuden parantuneen jokaisen käsittelytason jälkeen sekä mallin ennustajien vaikutusten muuttuneen järjellisemmiksi.
APA, Harvard, Vancouver, ISO, and other styles
6

Bischof, Stefan, Benedikt Kämpgen, Andreas Harth, Axel Polleres, and Patrik Schneider. "Open City Data Pipeline." Department für Informationsverarbeitung und Prozessmanagement, WU Vienna University of Economics and Business, 2017. http://epub.wu.ac.at/5438/1/city%2Dqb%2Dpaper.pdf.

Full text
Abstract:
Statistical data about cities, regions and at country level is collected for various purposes and from various institutions. Yet, while access to high quality and recent such data is crucial both for decision makers as well as for the public, all to often such collections of data remain isolated and not re-usable, let alone properly integrated. In this paper we present the Open City Data Pipeline, a focused attempt to collect, integrate, and enrich statistical data collected at city level worldwide, and republish this data in a reusable manner as Linked Data. The main feature of the Open City Data Pipeline are: (i) we integrate and cleanse data from several sources in a modular and extensible, always up-to-date fashion; (ii) we use both Machine Learning techniques as well as ontological reasoning over equational background knowledge to enrich the data by imputing missing values, (iii) we assess the estimated accuracy of such imputations per indicator. Additionally, (iv) we make the integrated and enriched data available both in a we browser interface and as machine-readable Linked Data, using standard vocabularies such as QB and PROV, and linking to e.g. DBpedia. Lastly, in an exhaustive evaluation of our approach, we compare our enrichment and cleansing techniques to a preliminary version of the Open City Data Pipeline presented at ISWC2015: firstly, we demonstrate that the combination of equational knowledge and standard machine learning techniques significantly helps to improve the quality of our missing value imputations; secondly, we arguable show that the more data we integrate, the more reliable our predictions become. Hence, over time, the Open City Data Pipeline shall provide a sustainable effort to serve Linked Data about cities in increasing quality.
Series: Working Papers on Information Systems, Information Business and Operations
APA, Harvard, Vancouver, ISO, and other styles
7

Pumpichet, Sitthapon. "Novel Online Data Cleaning Protocols for Data Streams in Trajectory, Wireless Sensor Networks." FIU Digital Commons, 2013. http://digitalcommons.fiu.edu/etd/1004.

Full text
Abstract:
The promise of Wireless Sensor Networks (WSNs) is the autonomous collaboration of a collection of sensors to accomplish some specific goals which a single sensor cannot offer. Basically, sensor networking serves a range of applications by providing the raw data as fundamentals for further analyses and actions. The imprecision of the collected data could tremendously mislead the decision-making process of sensor-based applications, resulting in an ineffectiveness or failure of the application objectives. Due to inherent WSN characteristics normally spoiling the raw sensor readings, many research efforts attempt to improve the accuracy of the corrupted or “dirty” sensor data. The dirty data need to be cleaned or corrected. However, the developed data cleaning solutions restrict themselves to the scope of static WSNs where deployed sensors would rarely move during the operation. Nowadays, many emerging applications relying on WSNs need the sensor mobility to enhance the application efficiency and usage flexibility. The location of deployed sensors needs to be dynamic. Also, each sensor would independently function and contribute its resources. Sensors equipped with vehicles for monitoring the traffic condition could be depicted as one of the prospective examples. The sensor mobility causes a transient in network topology and correlation among sensor streams. Based on static relationships among sensors, the existing methods for cleaning sensor data in static WSNs are invalid in such mobile scenarios. Therefore, a solution of data cleaning that considers the sensor movements is actively needed. This dissertation aims to improve the quality of sensor data by considering the consequences of various trajectory relationships of autonomous mobile sensors in the system. First of all, we address the dynamic network topology due to sensor mobility. The concept of virtual sensor is presented and used for spatio-temporal selection of neighboring sensors to help in cleaning sensor data streams. This method is one of the first methods to clean data in mobile sensor environments. We also study the mobility pattern of moving sensors relative to boundaries of sub-areas of interest. We developed a belief-based analysis to determine the reliable sets of neighboring sensors to improve the cleaning performance, especially when node density is relatively low. Finally, we design a novel sketch-based technique to clean data from internal sensors where spatio-temporal relationships among sensors cannot lead to the data correlations among sensor streams.
APA, Harvard, Vancouver, ISO, and other styles
8

Artilheiro, Fernando Manuel Freitas. "Analysis and procedures of multibeam data cleaning for bathymetric charting." Thesis, National Library of Canada = Bibliothèque nationale du Canada, 1996. http://www.collectionscanada.ca/obj/s4/f2/dsk2/ftp04/mq23776.pdf.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Ramakrishnan, Ranjani. "A data cleaning and annotation framework for genome-wide studies." Full text open access at:, 2007. http://content.ohsu.edu/u?/etd,263.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Hallström, Fredrik, and David Adolfsson. "Data Cleaning Extension on IoT Gateway : An Extended ThingsBoard Gateway." Thesis, Karlstads universitet, Institutionen för matematik och datavetenskap (from 2013), 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:kau:diva-84376.

Full text
Abstract:
Machine learning algorithms that run on Internet of Things sensory data requires high data quality to produce relevant output. By providing data cleaning at the edge, cloud infrastructures performing AI computations is relieved by not having to perform preprocessing. The main problem connected with edge cleaning is the dependency on unsupervised pre-processing as it leaves no guarantee of high quality output data. In this thesis an IoT gateway is extended to provide cleaning and live configuration of cleaning parameters before forwarding the data to a server cluster. Live configuration is implemented to be able to fit the parameters to match a time series and thereby mitigate quality issues. The gateway framework performance and used resources of the container was benchmarked using an MQTT stress tester. The gateway’s performance was under expectation. With high-frequency data streams, the throughput was below50%. However, these issues are not present for its Glava Energy Center connector, as their sensory data generates at a slower pace.
AI4ENERGY
APA, Harvard, Vancouver, ISO, and other styles
11

Bischof, Stefan, Andreas Harth, Benedikt Kämpgen, Axel Polleres, and Patrik Schneider. "Enriching integrated statistical open city data by combining equational knowledge and missing value imputation." Elsevier, 2017. http://dx.doi.org/10.1016/j.websem.2017.09.003.

Full text
Abstract:
Several institutions collect statistical data about cities, regions, and countries for various purposes. Yet, while access to high quality and recent such data is both crucial for decision makers and a means for achieving transparency to the public, all too often such collections of data remain isolated and not re-useable, let alone comparable or properly integrated. In this paper we present the Open City Data Pipeline, a focused attempt to collect, integrate, and enrich statistical data collected at city level worldwide, and re-publish the resulting dataset in a re-useable manner as Linked Data. The main features of the Open City Data Pipeline are: (i) we integrate and cleanse data from several sources in a modular and extensible, always up-to-date fashion; (ii) we use both Machine Learning techniques and reasoning over equational background knowledge to enrich the data by imputing missing values, (iii) we assess the estimated accuracy of such imputations per indicator. Additionally, (iv) we make the integrated and enriched data, including links to external data sources, such as DBpedia, available both in a web browser interface and as machine-readable Linked Data, using standard vocabularies such as QB and PROV. Apart from providing a contribution to the growing collection of data available as Linked Data, our enrichment process for missing values also contributes a novel methodology for combining rule-based inference about equational knowledge with inferences obtained from statistical Machine Learning approaches. While most existing works about inference in Linked Data have focused on ontological reasoning in RDFS and OWL, we believe that these complementary methods and particularly their combination could be fruitfully applied also in many other domains for integrating Statistical Linked Data, independent from our concrete use case of integrating city data.
APA, Harvard, Vancouver, ISO, and other styles
12

Bakhtiar, Qutub A. "Mitigating Inconsistencies by Coupling Data Cleaning, Filtering, and Contextual Data Validation in Wireless Sensor Networks." FIU Digital Commons, 2009. http://digitalcommons.fiu.edu/etd/99.

Full text
Abstract:
With the advent of peer to peer networks, and more importantly sensor networks, the desire to extract useful information from continuous and unbounded streams of data has become more prominent. For example, in tele-health applications, sensor based data streaming systems are used to continuously and accurately monitor Alzheimer's patients and their surrounding environment. Typically, the requirements of such applications necessitate the cleaning and filtering of continuous, corrupted and incomplete data streams gathered wirelessly in dynamically varying conditions. Yet, existing data stream cleaning and filtering schemes are incapable of capturing the dynamics of the environment while simultaneously suppressing the losses and corruption introduced by uncertain environmental, hardware, and network conditions. Consequently, existing data cleaning and filtering paradigms are being challenged. This dissertation develops novel schemes for cleaning data streams received from a wireless sensor network operating under non-linear and dynamically varying conditions. The study establishes a paradigm for validating spatio-temporal associations among data sources to enhance data cleaning. To simplify the complexity of the validation process, the developed solution maps the requirements of the application on a geometrical space and identifies the potential sensor nodes of interest. Additionally, this dissertation models a wireless sensor network data reduction system by ascertaining that segregating data adaptation and prediction processes will augment the data reduction rates. The schemes presented in this study are evaluated using simulation and information theory concepts. The results demonstrate that dynamic conditions of the environment are better managed when validation is used for data cleaning. They also show that when a fast convergent adaptation process is deployed, data reduction rates are significantly improved. Targeted applications of the developed methodology include machine health monitoring, tele-health, environment and habitat monitoring, intermodal transportation and homeland security.
APA, Harvard, Vancouver, ISO, and other styles
13

Lew, Alexander K. "PClean : Bayesian data cleaning at scale with domain-specific probabilistic programming." Thesis, Massachusetts Institute of Technology, 2020. https://hdl.handle.net/1721.1/130607.

Full text
Abstract:
Thesis: S.M., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, September, February, 2020
Cataloged from the official PDF version of thesis.
Includes bibliographical references (pages 89-93).
Data cleaning is naturally framed as probabilistic inference in a generative model, combining a prior distribution over ground-truth databases with a likelihood that models the noisy channel by which the data are filtered, corrupted, and joined to yield incomplete, dirty, and denormalized datasets. Based on this view, this thesis presents PClean, a unified generative modeling architecture for cleaning and normalizing dirty data in diverse domains. Given an unclean dataset and a probabilistic program encoding relevant domain knowledge, PClean learns a structured representation of the data as a relational database of interrelated objects, and uses this latent structure to impute missing values, identify duplicates, detect errors, and propose corrections in the original data table. PClean makes three modeling and inference contributions: (i) a domain-general non-parametric generative model of relational data, for inferring latent objects and their network of latent connections; (ii) a domain-specific probabilistic programming language, for encoding domain knowledge specific to each dataset being cleaned; and (iii) a domain-general inference engine that adapts to each PClean program by constructing data-driven proposals used in sequential Monte Carlo and particle Gibbs. This thesis shows empirically that short (< 50-line) PClean programs deliver higher accuracy than state-of-the-art data cleaning systems based on machine learning and weighted logic; that PClean's inference algorithm is faster than generic particle Gibbs inference for probabilistic programs; and that PClean scales to large real-world datasets with millions of rows.
by Alexander K. Lew.
S.M.
S.M. Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science
APA, Harvard, Vancouver, ISO, and other styles
14

Carreira, Paulo J. F. "Mapper: An Efficient Data Transformation Operator." Doctoral thesis, Department of Informatics, University of Lisbon, 2008. http://hdl.handle.net/10451/14295.

Full text
Abstract:
Data transformations are fundamental operations in legacy data migration, data integration, data cleaning, and data warehousing. These operations are often implemented as relational queries that aim at leveraging the optimization capabilities of most DBMSs. However, relational query languages like SQL are not expressive enough to specify one-to-many data transformations, an important class of data transformations that produce several output tuples for a single input tuple. These transformations are required for solving several types of data heterogeneities, like those that occur when the source data represents aggregations of the target data. This thesis proposes a new relational operator, named data mapper, as an extension to the relational algebra to address one-to-many data transformations and focus on its optimization. It also provides algebraic rewriting rules and execution algorithms for the logical and physical optimization, respectively. As a result, queries may be expressed as a combination of standard relational operators and mappers. The proposed optimizations have been experimentally validated and the key factors that influence the obtained performance gains identified.
APA, Harvard, Vancouver, ISO, and other styles
15

Alkharboush, Nawaf Abdullah H. "A data mining approach to improve the automated quality of data." Thesis, Queensland University of Technology, 2014. https://eprints.qut.edu.au/65641/1/Nawaf%20Abdullah%20H_Alkharboush_Thesis.pdf.

Full text
Abstract:
This thesis describes the development of a robust and novel prototype to address the data quality problems that relate to the dimension of outlier data. It thoroughly investigates the associated problems with regards to detecting, assessing and determining the severity of the problem of outlier data; and proposes granule-mining based alternative techniques to significantly improve the effectiveness of mining and assessing outlier data.
APA, Harvard, Vancouver, ISO, and other styles
16

Bourennani, Farid. "Integration of heterogeneous data types using self organizing maps." Thesis, UOIT, 2009. http://hdl.handle.net/10155/41.

Full text
Abstract:
With the growth of computer networks and the advancement of hardware technologies, unprecedented access to data volumes become accessible in a distributed fashion forming heterogeneous data sources. Understanding and combining these data into data warehouses, or merging remote public data into existing databases can significantly enrich the information provided by these data. This problem is called data integration: combining data residing at different sources, and providing the user with a unified view of these data. There are two issues with making use of remote data sources: (1) discovery of relevant data sources, and (2) performing the proper joins between the local data source and the relevant remote databases. Both can be solved if one can effectively identify semantically-related attributes between the local data sources and the available remote data sources. However, performing these tasks manually is time-consuming because of the large data sizes and the unavailability of schema documentation; therefore, an automated tool would be definitely more suitable. Automatically detecting similar entities based on the content is challenging due to three factors. First, because the amount of records is voluminous, it is difficult to perceive or discover information structures or relationships. Second, the schemas of the databases are unfamiliar; therefore, detecting relevant data is difficult. Third, the database entity types are heterogeneous and there is no existing solution for extracting a richer classification result from the processing of two different data types, or at least from textual and numerical data. We propose to utilize self-organizing maps (SOM) to aid the visual exploration of the large data volumes. The unsupervised classification property of SOM facilitates the integration of completely unfamiliar relational database tables and attributes based on the contents. In order to accommodate heterogeneous data types found in relational databases, we extended the term frequency – inverse document frequency (TF-IDF) measure to handle numerical and textual attribute types by unified vectorization processing. The resulting map allows the user to browse the heterogeneously typed database attributes and discover clusters of documents (attributes) having similar content. iii The discovered clusters can significantly aid in manual or automated constructions of data integrity constraints in data cleaning or schema mappings for data integration.
APA, Harvard, Vancouver, ISO, and other styles
17

GAGLIARDELLI, LUCA. "Tecniche per l’Integrazione di Sorgenti Big Data in Ambienti di Calcolo Distribuito." Doctoral thesis, Università degli studi di Modena e Reggio Emilia, 2020. http://hdl.handle.net/11380/1200610.

Full text
Abstract:
Sorgenti che forniscono grandi quantitativi di dati semi-strutturati sono disponibili sul Web in forma di tabelle, contenuti annotati (e.s. RDF) e Linked Open Data. Questi dati se debitamente manipolati e integrati tra loro o con dati proprietari, possono costituire una preziosa fonte di informazione per aziende, ricercatori e agenzie governative. Il problema principale in fase di integrazione è dato dal fatto che queste sorgenti dati sono tipicamente eterogenee e non presentano chiavi su cui poter eseguire operazioni di join per unire facilmente i record. Trovare un modo per effettuare il join senza avere le chiavi è un processo fondamentale e critico dell’integrazione dei dati. Inoltre, per molte applicazioni, il tempo di esecuzione è una componente fondamentale (e.s. nel contesto della sicurezza nazionale) e il calcolo distribuito può essere utilizzato per ridurlo sensibilmente. In questa dissertazione presento delle tecniche distribuite per l’integrazione dati che consentono di scalare su grandi volumi di dati (Big Data), in particolare: SparkER e GraphJoin. SparkER è un tool per Entity Resolution che mira ad utilizzare il calcolo distribuito per identificare record che si riferiscono alla stessa entità del mondo reale, consentendo così l’integrazione di questi record. Questo tool introduce un nuovo algoritmo per parallelizzare le tecniche di indicizzazione che sono attualmente lo stato dell’arte. SparkER è un prototipo software funzionante che ho sviluppato e utilizzato per eseguire degli esperimenti su dati reali; i risultati ottenuti mostrano che le tecniche di parallelizzazione che ho sviluppato sono più efficienti in termini di tempo di esecuzione e utilizzo di memoria rispetto a quelle già esistenti in letteratura. GraphJoin è una nuova tecnica che consente di trovare record simili applicando delle regole di join su uno o più attributi. Questa tecnica combina tecniche di join similarity pensate per lavorare su una singola regola, ottimizzandone l’esecuzione con più regole, combinando diverse misure di similarità basate sia su token che su caratteri (e.s. Jaccard Similarity e Edit Distance). Per il GraphJoin ho sviluppato un prototipo software funzionante e l’ho utilizzato per eseguire esperimenti che dimostrano che la tecnica proposta è efficace ed è più efficiente di quelle già esistenti in termini di tempo di esecuzione.
Data sources that provide a huge amount of semi-structured data are available on Web as tables, annotated contents (e.g. RDF) and Linked Open Data. These sources can constitute a valuable source of information for companies, researchers and government agencies, if properly manipulated and integrated with each other or with proprietary data. One of the main problems is that typically these sources are heterogeneous and do not come with keys to perform join operations, and effortlessly linking their records. Thus, finding a way to join data sources without keys is a fundamental and critical process of data integration. Moreover, for many applications, the execution time is a critical component (e.g., in finance of national security context) and distributed computing can be employed to significantly it. In this dissertation, I present distributed data integration techniques that allow to scale to large volumes of data (i.e., Big Data), in particular: SparkER and GraphJoin. SparkER is an Entity Resolution tool that aims to exploit the distributed computing to identify records in data sources that refer to the same real-world entity—thus enabling the integration of the records. This tool introduces a novel algorithm to parallelize the indexing techniques that are currently state-of-the-art. SparkER is a working software prototype that I developed and employed to perform experiments over real data sets; the results show that the parallelization techniques that I have developed are more efficient in terms of execution time and memory usage than those in literature. GraphJoin is a novel technique that allows to find similar records by applying joining rules on one or more attributes. This technique combines similarity join techniques designed to work on a single rule, optimizing their execution with multiple joining rules, combining different similarity measures both token- and character- based (e.g., Jaccard Similarity and Edit Distance). For GraphJoin I developed a working software prototype and I employed it to experimentally demonstrate that the proposed technique is effective and outperforms the existing ones in terms of execution time.
APA, Harvard, Vancouver, ISO, and other styles
18

Jardini, Toni [UNESP]. "Ambiente data cleaning: suporte extensível, semântico e automático para análise e transformação de dados." Universidade Estadual Paulista (UNESP), 2012. http://hdl.handle.net/11449/98702.

Full text
Abstract:
Made available in DSpace on 2014-06-11T19:29:41Z (GMT). No. of bitstreams: 0 Previous issue date: 2012-11-30Bitstream added on 2014-06-13T19:39:00Z : No. of bitstreams: 1 jardini_t_me_sjrp.pdf: 3132731 bytes, checksum: f7d17c296de5c8631819f117979b411d (MD5)
Um dos grandes desa os e di culdades para se obter conhecimento de fontes de dados e garantir consistência e a não duplicidade das informações armazenadas. Diversas técnicas e algoritmos têm sido propostos para minimizar o custoso trabalho de permitir que os dados sejam analisados e corrigidos. Porém, ainda há outras vertentes essenciais para se obter sucesso no processo de limpeza de dados, e envolvem diversas areas tecnológicas: desempenho computacional, semântica e autonomia do processo. Diante desse cenário, foi desenvolvido um ambiente data cleaningque contempla uma coleção de ferramentas de suporte a análise e transformação de dados de forma automática, extensível, com suporte semântico e aprendizado, independente de idioma. O objetivo deste trabalho e propor um ambiente cujas contribuições cobrem problemas ainda pouco explorados pela comunidade científica area de limpeza de dados como semântica e autonomia na execução da limpeza e possui, dentre seus objetivos, diminuir a interação do usuário no processo de análise e correção de inconsistências e duplicidades. Dentre as contribuições do ambiente desenvolvido, a eficácia se mostras significativa, cobrindo aproximadamente 90% do total de inconsistências presentes na base de dados, com percentual de casos de falsos-positivos 0% sem necessidade da interação do usuário
One of the great challenges and di culties to obtain knowledge from data sources is to ensure consistency and non-duplication of stored data. Many techniques and algorithms have been proposed to minimize the hard work to allow data to be analyzed and corrected. However, there are still other essential aspects for the data cleaning process success which involve many technological areas: performance, semantic and process autonomy. Against this backdrop, an data cleaning environment has been developed which includes a collec-tion of tools for automatic data analysis and processing, extensible, with multi-language semantic and learning support. The objective of this work is to propose an environment whose contributions cover problems yet explored by data cleaning scienti c community as semantic and autonomy in data cleaning process and it has, among its objectives, to re-duce user interaction in the process of analyzing and correcting data inconsistencies and duplications. Among the contributions of the developed environment, e ciency is signi -cant exhibitions, covering approximately 90% of database inconsistencies, with the 0% of false positives cases without the user interaction need
APA, Harvard, Vancouver, ISO, and other styles
19

Jardini, Toni. "Ambiente data cleaning : suporte extensível, semântico e automático para análise e transformação de dados /." São José do Rio Preto : [s.n.], 2012. http://hdl.handle.net/11449/98702.

Full text
Abstract:
Orientador: Carlos Roberto Valêncio
Banca: Nalvo Franco de Almeida Junior
Banca: José Márcio Machado
Resumo: Um dos grandes desa os e di culdades para se obter conhecimento de fontes de dados e garantir consistência e a não duplicidade das informações armazenadas. Diversas técnicas e algoritmos têm sido propostos para minimizar o custoso trabalho de permitir que os dados sejam analisados e corrigidos. Porém, ainda há outras vertentes essenciais para se obter sucesso no processo de limpeza de dados, e envolvem diversas areas tecnológicas: desempenho computacional, semântica e autonomia do processo. Diante desse cenário, foi desenvolvido um ambiente data cleaningque contempla uma coleção de ferramentas de suporte a análise e transformação de dados de forma automática, extensível, com suporte semântico e aprendizado, independente de idioma. O objetivo deste trabalho e propor um ambiente cujas contribuições cobrem problemas ainda pouco explorados pela comunidade científica area de limpeza de dados como semântica e autonomia na execução da limpeza e possui, dentre seus objetivos, diminuir a interação do usuário no processo de análise e correção de inconsistências e duplicidades. Dentre as contribuições do ambiente desenvolvido, a eficácia se mostras significativa, cobrindo aproximadamente 90% do total de inconsistências presentes na base de dados, com percentual de casos de falsos-positivos 0% sem necessidade da interação do usuário
Abstract: One of the great challenges and di culties to obtain knowledge from data sources is to ensure consistency and non-duplication of stored data. Many techniques and algorithms have been proposed to minimize the hard work to allow data to be analyzed and corrected. However, there are still other essential aspects for the data cleaning process success which involve many technological areas: performance, semantic and process autonomy. Against this backdrop, an data cleaning environment has been developed which includes a collec-tion of tools for automatic data analysis and processing, extensible, with multi-language semantic and learning support. The objective of this work is to propose an environment whose contributions cover problems yet explored by data cleaning scienti c community as semantic and autonomy in data cleaning process and it has, among its objectives, to re-duce user interaction in the process of analyzing and correcting data inconsistencies and duplications. Among the contributions of the developed environment, e ciency is signi -cant exhibitions, covering approximately 90% of database inconsistencies, with the 0% of false positives cases without the user interaction need
Mestre
APA, Harvard, Vancouver, ISO, and other styles
20

Neelisetty, Srikanth. "Detector Diagnostics, Data Cleaning and Improved Single Loop Velocity Estimation from Conventional Loop Detectors." The Ohio State University, 2004. http://rave.ohiolink.edu/etdc/view?acc_num=osu1419350524.

Full text
APA, Harvard, Vancouver, ISO, and other styles
21

Zelený, Pavel. "Řízení kvality dat v malých a středních firmách." Master's thesis, Vysoká škola ekonomická v Praze, 2010. http://www.nusl.cz/ntk/nusl-82036.

Full text
Abstract:
This diploma thesis deals with the data quality management. There are many tools and methodologies to support the data quality management even in Czech market but they are all only for large companies. Small and middle companies can't afford them because of high cost. The first goal of this thesis is to summarize principles of the methodologies and then on the base of the methodologies to suggest more simple methodology for small and middle companies. In the second part of thesis is created and adapted the methodology for a specific company. The first step is to choose the data area of interest in the company. Because of impossibility to buy a software tool to clean data, there are defined relatively simple rules which are base source to create cleaning scripts in SQL language. The scripts are used for automatic data cleaning. On the base of next analyze is decided what data should be cleaned manually. In the next step are described recommendations how to remove duplicities from the database. There is used a functionality of the company's production system. The last step of the methodology is to create a control mechanism which have to keep the required data quality in future. At the end of thesis is made a data research in four data sources. All these sources are from companies using the same production system. The reason of research is to present the overview of data quality and to help with decision about cleaning data in the companies also.
APA, Harvard, Vancouver, ISO, and other styles
22

Cenonfolo, Filippo. "Signal cleaning techniques and anomaly detection algorithms for motorbike applications." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2021.

Find full text
Abstract:
This paper outlines the results of the curricular internship project at the Research and Development section of Ducati Motor Holding S.p.A. in collaboration with the Motorvehicle University of Emilia-Romagna (MUNER). The focus is the development of a diagnostic plugin specifically tailored for motorcycle applications with the aim of automatically detecting anomalous behaviors of the signals recorded from the sensors mounted on-board. Acquisitions are performed whenever motorbikes are tested and they contain a variable number of channels related to the different parameters engineers decide to store for the after run analysis. Dealing with this complexity might be hard on its own, but the correct interpretation of data becomes even more demanding whenever signals present corruption or are affected by a relevant degree of noise. For this reason, the whole internship projects is centered on a research around signal cleaning techniques and anomaly detection algorithms which aims at developing an automatic diagnostic tool. The final goal is to implement a preliminary processing on the acquisition that allows an understanding of the quality of the signals recorded and, if possible, applies strategies that reduce the impact of the anomalies on the overall dataset.
APA, Harvard, Vancouver, ISO, and other styles
23

Feng, Yuan. "Improve Data Quality By Using Dependencies And Regular Expressions." Thesis, Mittuniversitetet, Avdelningen för informationssystem och -teknologi, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:miun:diva-35620.

Full text
Abstract:
The objective of this study has been to answer the question of finding ways to improve the quality of database. There exists a lot of problems of the data stored in the database, like missing or spelling errors. To deal with the dirty data in the database, this study adopts the conditional functional dependencies and regular expressions to detect and correct data. Based on the former studies of data cleaning methods, this study considers the more complex conditions of database and combines the efficient algorithms to deal with the data. The study shows that by using these methods, the database’s quality can be improved and considering the complexity of time and space, there still has a lot of things to do to make the data cleaning process more efficiency.
APA, Harvard, Vancouver, ISO, and other styles
24

Belen, Rahime. "Detecting Disguised Missing Data." Master's thesis, METU, 2009. http://etd.lib.metu.edu.tr/upload/12610411/index.pdf.

Full text
Abstract:
In some applications, explicit codes are provided for missing data such as NA (not available) however many applications do not provide such explicit codes and valid or invalid data codes are recorded as legitimate data values. Such missing values are known as disguised missing data. Disguised missing data may affect the quality of data analysis negatively, for example the results of discovered association rules in KDD-Cup-98 data sets have clearly shown the need of applying data quality management prior to analysis. In this thesis, to tackle the problem of disguised missing data, we analyzed embedded unbiased sample heuristic (EUSH), demonstrated the methods drawbacks and proposed a new methodology based on Chi Square Two Sample Test. The proposed method does not require any domain background knowledge and compares favorably with EUSH.
APA, Harvard, Vancouver, ISO, and other styles
25

Pipanmaekaporn, Luepol. "A data mining framework for relevance feature discovery." Thesis, Queensland University of Technology, 2013. https://eprints.qut.edu.au/62857/1/Luepol_Pipanmaekaporn_Thesis.pdf.

Full text
Abstract:
This thesis is a study for automatic discovery of text features for describing user information needs. It presents an innovative data-mining approach that discovers useful knowledge from both relevance and non-relevance feedback information. The proposed approach can largely reduce noises in discovered patterns and significantly improve the performance of text mining systems. This study provides a promising method for the study of Data Mining and Web Intelligence.
APA, Harvard, Vancouver, ISO, and other styles
26

Mahdavi, Lahijani Mohammad [Verfasser], Ziawasch [Akademischer Betreuer] Abedjan, Ziawasch [Gutachter] Abedjan, Wolfgang [Gutachter] Lehner, and Eugene [Gutachter] Wu. "Semi-supervised data cleaning / Mohammad Mahdavi Lahijani ; Gutachter: Ziawasch Abedjan, Wolfgang Lehner, Eugene Wu ; Betreuer: Ziawasch Abedjan." Berlin : Technische Universität Berlin, 2020. http://d-nb.info/1223023060/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
27

Pecorella, Tommaso. "Progettazione ed implementazione di un data warehouse di supporto alla profilazione dei consumi energetici domestici." Bachelor's thesis, Alma Mater Studiorum - Università di Bologna, 2015. http://amslaurea.unibo.it/8355/.

Full text
APA, Harvard, Vancouver, ISO, and other styles
28

LI, PEI. "Linking records with value diversity." Doctoral thesis, Università degli Studi di Milano-Bicocca, 2013. http://hdl.handle.net/10281/42976.

Full text
Abstract:
Most record linkage techniques assume that information of the underlying entities do not change and is provided in different representations and sometimes with errors. For example, mailing lists may contain multiple entries representing the same physical address, but each record may be slightly different, e.g., containing different spellings or missing some information. As a second example, consider a company that has different customer databases (e.g., one for each subsidiary). A given customer may appear in different ways in each database, and there is a fair amount of guesswork in determining which customers match. However, in real-world, we often observe value diversity in real-world data sets for linkage. For example, many data sets contains temporal records over a long period of time; each record is associated with a time stamp and describes some aspects of a real-world entity at that particular time (e.g., author information in DBLP). In such cases, we often wish to identify records that describe the same entity over time and so be able to enable interesting longitudinal data analysis. Value diversity also exists group linkage: linking records that refer to entities in the same group. Applications for group linkage includes finding businesses in the same chain, finding conference attendants from the same affiliation, finding players from the same team, etc. In such cases, although different members in the same group can share some similar global values, they represent different entities so can also have distinct local values, requiring a high tolerance for value diversity. However, most existing record linkage techniques assume that records describing the same real-world entities are fairly consistent and often focus on different representations of the same value, such as ”IBM” and ”International Business Machines”. Thus, they can fall short when values may vary for the same entity. This dissertation studies how to improve linkage quality of integrated data with tolerance to fairly high diversity, including temporal linkage, and group linkage. We solve the problem of temporal record linkage in two ways. First, we apply time decay to capture the effect of elapsed time on entity value evolution. Second, instead of comparing each pair of records locally, we propose clustering methods that consider time order of the records and make global decisions. Experimental results show that our algorithms significantly outperform traditional linkage methods on various temporal data sets. For group linkage, we present a two-stage algorithm: the first stage identifies cores containing records that are very likely to belong to the same group; the second stage collects strong evidence from the cores and leverages it for merging more records in the same group, while being tolerant to differences in other values. Our algorithm is designed to ensure efficiency and scalability. An experiment shows that it finished in 2.4 hours on a real-world data set containing 6.8 million records, and obtained both a precision and a recall of above .95. Finally, we build the CHRONOS system which offers users the useful tool for finding real-world entities over time and understanding history of entities in the bibliography domain. The core of CHRONOS is a temporal record-linkage algorithm, which is tolerant to value evolution over time. Our algorithm can obtain an F-measure of over 0.9 in linking author records and fix errors made by DBLP. We show how CHRONOS allows users to explore the history of authors, and how it helps users understand our linkage results by comparing our results with those of existing systems, highlighting differences in the results, explaining our decisions to users, and answering “what-if” questions.
APA, Harvard, Vancouver, ISO, and other styles
29

Tian, Yongchao. "Accéler la préparation des données pour l'analyse du big data." Thesis, Paris, ENST, 2017. http://www.theses.fr/2017ENST0017/document.

Full text
Abstract:
Nous vivons dans un monde de big data, où les données sont générées en grand volume, grande vitesse et grande variété. Le big data apportent des valeurs et des avantages énormes, de sorte que l’analyse des données est devenue un facteur essentiel de succès commercial dans tous les secteurs. Cependant, si les données ne sont pas analysées assez rapidement, les bénéfices de big data seront limités ou même perdus. Malgré l’existence de nombreux systèmes modernes d’analyse de données à grande échelle, la préparation des données est le processus le plus long de l’analyse des données, n’a pas encore reçu suffisamment d’attention. Dans cette thèse, nous étudions le problème de la façon d’accélérer la préparation des données pour le big data d’analyse. En particulier, nous nous concentrons sur deux grandes étapes de préparation des données, le chargement des données et le nettoyage des données. Comme première contribution de cette thèse, nous concevons DiNoDB, un système SQL-on-Hadoop qui réalise l’exécution de requêtes à vitesse interactive sans nécessiter de chargement de données. Les applications modernes impliquent de lourds travaux de traitement par lots sur un grand volume de données et nécessitent en même temps des analyses interactives ad hoc efficaces sur les données temporaires générées dans les travaux de traitement par lots. Les solutions existantes ignorent largement la synergie entre ces deux aspects, nécessitant de charger l’ensemble des données temporaires pour obtenir des requêtes interactives. En revanche, DiNoDB évite la phase coûteuse de chargement et de transformation des données. L’innovation importante de DiNoDB est d’intégrer à la phase de traitement par lots la création de métadonnées que DiNoDB exploite pour accélérer les requêtes interactives. La deuxième contribution est un système de flux distribué de nettoyage de données, appelé Bleach. Les approches de nettoyage de données évolutives existantes s’appuient sur le traitement par lots pour améliorer la qualité des données, qui demandent beaucoup de temps. Nous ciblons le nettoyage des données de flux dans lequel les données sont nettoyées progressivement en temps réel. Bleach est le premier système de nettoyage qualitatif de données de flux, qui réalise à la fois la détection des violations en temps réel et la réparation des données sur un flux de données sale. Il s’appuie sur des structures de données efficaces, compactes et distribuées pour maintenir l’état nécessaire pour nettoyer les données et prend également en charge la dynamique des règles. Nous démontrons que les deux systèmes résultants, DiNoDB et Bleach, ont tous deux une excellente performance par rapport aux approches les plus avancées dans nos évaluations expérimentales, et peuvent aider les chercheurs à réduire considérablement leur temps consacré à la préparation des données
We are living in a big data world, where data is being generated in high volume, high velocity and high variety. Big data brings enormous values and benefits, so that data analytics has become a critically important driver of business success across all sectors. However, if the data is not analyzed fast enough, the benefits of big data will be limited or even lost. Despite the existence of many modern large-scale data analysis systems, data preparation which is the most time-consuming process in data analytics has not received sufficient attention yet. In this thesis, we study the problem of how to accelerate data preparation for big data analytics. In particular, we focus on two major data preparation steps, data loading and data cleaning. As the first contribution of this thesis, we design DiNoDB, a SQL-on-Hadoop system which achieves interactive-speed query execution without requiring data loading. Modern applications involve heavy batch processing jobs over large volume of data and at the same time require efficient ad-hoc interactive analytics on temporary data generated in batch processing jobs. Existing solutions largely ignore the synergy between these two aspects, requiring to load the entire temporary dataset to achieve interactive queries. In contrast, DiNoDB avoids the expensive data loading and transformation phase. The key innovation of DiNoDB is to piggyback on the batch processing phase the creation of metadata, that DiNoDB exploits to expedite the interactive queries. The second contribution is a distributed stream data cleaning system, called Bleach. Existing scalable data cleaning approaches rely on batch processing to improve data quality, which are very time-consuming in nature. We target at stream data cleaning in which data is cleaned incrementally in real-time. Bleach is the first qualitative stream data cleaning system, which achieves both real-time violation detection and data repair on a dirty data stream. It relies on efficient, compact and distributed data structures to maintain the necessary state to clean data, and also supports rule dynamics. We demonstrate that the two resulting systems, DiNoDB and Bleach, both of which achieve excellent performance compared to state-of-the-art approaches in our experimental evaluations, and can help data scientists significantly reduce their time spent on data preparation
APA, Harvard, Vancouver, ISO, and other styles
30

Sadeghianasl, Sareh. "The quality guardian: Improving activity label quality in event logs through gamification." Thesis, Queensland University of Technology, 2022. https://eprints.qut.edu.au/229543/1/Sareh_Sadeghianasl_Thesis.pdf.

Full text
Abstract:
Data cleaning, the most tedious task of data analysis, can turn into a fun experience when performed through a game. This thesis shows that the use of gamification and crowdsourcing techniques can mitigate the problem of poor quality of process data. The Quality Guardian, a family of gamified systems, is proposed, which exploits the motivational drives of domain experts to engage with the detection and repair of imperfect activity labels in process data. Evaluation of the developed games using real-life data sets and domain experts shows quality improvement as well as a positive user experience.
APA, Harvard, Vancouver, ISO, and other styles
31

Ortona, Stefano. "Easing information extraction on the web through automated rules discovery." Thesis, University of Oxford, 2016. https://ora.ox.ac.uk/objects/uuid:a5a7a070-338a-4afc-8be5-a38b486cf526.

Full text
Abstract:
The advent of the era of big data on the Web has made automatic web information extraction an essential tool in data acquisition processes. Unfortunately, automated solutions are in most cases more error prone than those created by humans, resulting in dirty and erroneous data. Automatic repair and cleaning of the extracted data is thus a necessary complement to information extraction on the Web. This thesis investigates the problem of inducing cleaning rules on web extracted data in order to (i) repair and align the data w.r.t. an original target schema, (ii) produce repairs that are as generic as possible such that different instances can benefit from them. The problem is addressed from three different angles: replace cross-site redundancy with an ensemble of entity recognisers; produce general repairs that can be encoded in the extraction process; and exploit entity-wide relations to infer common knowledge on extracted data. First, we present ROSeAnn, an unsupervised approach to integrate semantic annotators and produce a unied and consistent annotation layer on top of them. Both the diversity in vocabulary and widely varying accuracy justify the need for middleware that reconciles different annotator opinions. Considering annotators as "black-boxes" that do not require per-domain supervision allows us to recognise semantically related content in web extracted data in a scalable way. Second, we show in WADaR how annotators can be used to discover rules to repair web extracted data. We study the problem of computing joint repairs for web data extraction programs and their extracted data, providing an approximate solution that requires no per-source supervision and proves effective across a wide variety of domains and sources. The proposed solution is effective not only in repairing the extracted data, but also in encoding such repairs in the original extraction process. Third, we investigate how relationships among entities can be exploited to discover inconsistencies and additional information. We present RuDiK, a disk-based scalable solution to discover first-order logic rules over RDF knowledge bases built from web sources. We present an approach that does not limit its search space to rules that rely on "positive" relationships between entities, as in the case with traditional mining of constraints. On the contrary, it extends the search space to also discover negative rules, i.e., patterns that lead to contradictions in the data.
APA, Harvard, Vancouver, ISO, and other styles
32

Tian, Yongchao. "Accéler la préparation des données pour l'analyse du big data." Electronic Thesis or Diss., Paris, ENST, 2017. http://www.theses.fr/2017ENST0017.

Full text
Abstract:
Nous vivons dans un monde de big data, où les données sont générées en grand volume, grande vitesse et grande variété. Le big data apportent des valeurs et des avantages énormes, de sorte que l’analyse des données est devenue un facteur essentiel de succès commercial dans tous les secteurs. Cependant, si les données ne sont pas analysées assez rapidement, les bénéfices de big data seront limités ou même perdus. Malgré l’existence de nombreux systèmes modernes d’analyse de données à grande échelle, la préparation des données est le processus le plus long de l’analyse des données, n’a pas encore reçu suffisamment d’attention. Dans cette thèse, nous étudions le problème de la façon d’accélérer la préparation des données pour le big data d’analyse. En particulier, nous nous concentrons sur deux grandes étapes de préparation des données, le chargement des données et le nettoyage des données. Comme première contribution de cette thèse, nous concevons DiNoDB, un système SQL-on-Hadoop qui réalise l’exécution de requêtes à vitesse interactive sans nécessiter de chargement de données. Les applications modernes impliquent de lourds travaux de traitement par lots sur un grand volume de données et nécessitent en même temps des analyses interactives ad hoc efficaces sur les données temporaires générées dans les travaux de traitement par lots. Les solutions existantes ignorent largement la synergie entre ces deux aspects, nécessitant de charger l’ensemble des données temporaires pour obtenir des requêtes interactives. En revanche, DiNoDB évite la phase coûteuse de chargement et de transformation des données. L’innovation importante de DiNoDB est d’intégrer à la phase de traitement par lots la création de métadonnées que DiNoDB exploite pour accélérer les requêtes interactives. La deuxième contribution est un système de flux distribué de nettoyage de données, appelé Bleach. Les approches de nettoyage de données évolutives existantes s’appuient sur le traitement par lots pour améliorer la qualité des données, qui demandent beaucoup de temps. Nous ciblons le nettoyage des données de flux dans lequel les données sont nettoyées progressivement en temps réel. Bleach est le premier système de nettoyage qualitatif de données de flux, qui réalise à la fois la détection des violations en temps réel et la réparation des données sur un flux de données sale. Il s’appuie sur des structures de données efficaces, compactes et distribuées pour maintenir l’état nécessaire pour nettoyer les données et prend également en charge la dynamique des règles. Nous démontrons que les deux systèmes résultants, DiNoDB et Bleach, ont tous deux une excellente performance par rapport aux approches les plus avancées dans nos évaluations expérimentales, et peuvent aider les chercheurs à réduire considérablement leur temps consacré à la préparation des données
We are living in a big data world, where data is being generated in high volume, high velocity and high variety. Big data brings enormous values and benefits, so that data analytics has become a critically important driver of business success across all sectors. However, if the data is not analyzed fast enough, the benefits of big data will be limited or even lost. Despite the existence of many modern large-scale data analysis systems, data preparation which is the most time-consuming process in data analytics has not received sufficient attention yet. In this thesis, we study the problem of how to accelerate data preparation for big data analytics. In particular, we focus on two major data preparation steps, data loading and data cleaning. As the first contribution of this thesis, we design DiNoDB, a SQL-on-Hadoop system which achieves interactive-speed query execution without requiring data loading. Modern applications involve heavy batch processing jobs over large volume of data and at the same time require efficient ad-hoc interactive analytics on temporary data generated in batch processing jobs. Existing solutions largely ignore the synergy between these two aspects, requiring to load the entire temporary dataset to achieve interactive queries. In contrast, DiNoDB avoids the expensive data loading and transformation phase. The key innovation of DiNoDB is to piggyback on the batch processing phase the creation of metadata, that DiNoDB exploits to expedite the interactive queries. The second contribution is a distributed stream data cleaning system, called Bleach. Existing scalable data cleaning approaches rely on batch processing to improve data quality, which are very time-consuming in nature. We target at stream data cleaning in which data is cleaned incrementally in real-time. Bleach is the first qualitative stream data cleaning system, which achieves both real-time violation detection and data repair on a dirty data stream. It relies on efficient, compact and distributed data structures to maintain the necessary state to clean data, and also supports rule dynamics. We demonstrate that the two resulting systems, DiNoDB and Bleach, both of which achieve excellent performance compared to state-of-the-art approaches in our experimental evaluations, and can help data scientists significantly reduce their time spent on data preparation
APA, Harvard, Vancouver, ISO, and other styles
33

Nunes, Marcos Freitas. "Avaliação experimental de uma técnica de padronização de escores de similaridade." reponame:Biblioteca Digital de Teses e Dissertações da UFRGS, 2009. http://hdl.handle.net/10183/25494.

Full text
Abstract:
Com o crescimento e a facilidade de acesso a Internet, o volume de dados cresceu muito nos últimos anos e, consequentemente, ficou muito fácil o acesso a bases de dados remotas, permitindo integrar dados fisicamente distantes. Geralmente, instâncias de um mesmo objeto no mundo real, originadas de bases distintas, apresentam diferenças na representação de seus valores, ou seja, os mesmos dados no mundo real podem ser representados de formas diferentes. Neste contexto, surgiram os estudos sobre casamento aproximado utilizando funções de similaridade. Por consequência, surgiu a dificuldade de entender os resultados das funções e selecionar limiares ideais. Quando se trata de casamento de agregados (registros), existe o problema de combinar os escores de similaridade, pois funções distintas possuem distribuições diferentes. Com objetivo de contornar este problema, foi desenvolvida em um trabalho anterior uma técnica de padronização de escores, que propõe substituir o escore calculado pela função de similaridade por um escore ajustado (calculado através de um treinamento), o qual é intuitivo para o usuário e pode ser combinado no processo de casamento de registros. Tal técnica foi desenvolvida por uma aluna de doutorado do grupo de Banco de Dados da UFRGS e será chamada aqui de MeaningScore (DORNELES et al., 2007). O presente trabalho visa estudar e realizar uma avaliação experimental detalhada da técnica MeaningScore. Com o final do processo de avaliação aqui executado, é possível afirmar que a utilização da abordagem MeaningScore é válida e retorna melhores resultados. No processo de casamento de registros, onde escores de similaridades distintos devem ser combinados, a utilização deste escore padronizado ao invés do escore original, retornado pela função de similaridade, produz resultados com maior qualidade.
With the growth of the Web, the volume of information grew considerably over the past years, and consequently, the access to remote databases became easier, which allows the integration of distributed information. Usually, instances of the same object in the real world, originated from distinct databases, present differences in the representation of their values, which means that the same information can be represented in different ways. In this context, research on approximate matching using similarity functions arises. As a consequence, there is a need to understand the result of the functions and to select ideal thresholds. Also, when matching records, there is the problem of combining the similarity scores, since distinct functions have different distributions. With the purpose of overcoming this problem, a previous work developed a technique that standardizes the scores, by replacing the computed score by an adjusted score (computed through a training), which is more intuitive for the user and can be combined in the process of record matching. This work was developed by a Phd student from the UFRGS database research group, and is referred to as MeaningScore (DORNELES et al., 2007). The present work intends to study and perform an experimental evaluation of this technique. As the validation shows, it is possible to say that the usage of the MeaningScore approach is valid and return better results. In the process of record matching, where distinct similarity must be combined, the usage of the adjusted score produces results with higher quality.
APA, Harvard, Vancouver, ISO, and other styles
34

Blackmore, Caitlin E. "The Effectiveness of Warnings at Reducing the Prevalence of Insufficient Effort Responding." Wright State University / OhioLINK, 2014. http://rave.ohiolink.edu/etdc/view?acc_num=wright1412080619.

Full text
APA, Harvard, Vancouver, ISO, and other styles
35

Boskovitz, Agnes, and abvi@webone com au. "Data Editing and Logic: The covering set method from the perspective of logic." The Australian National University. Research School of Information Sciences and Engineering, 2008. http://thesis.anu.edu.au./public/adt-ANU20080314.163155.

Full text
Abstract:
Errors in collections of data can cause significant problems when those data are used. Therefore the owners of data find themselves spending much time on data cleaning. This thesis is a theoretical work about one part of the broad subject of data cleaning - to be called the covering set method. More specifically, the covering set method deals with data records that have been assessed by the use of edits, which are rules that the data records are supposed to obey. The problem solved by the covering set method is the error localisation problem, which is the problem of determining the erroneous fields within data records that fail the edits. In this thesis I analyse the covering set method from the perspective of propositional logic. I demonstrate that the covering set method has strong parallels with well-known parts of propositional logic. The first aspect of the covering set method that I analyse is the edit generation function, which is the main function used in the covering set method. I demonstrate that the edit generation function can be formalised as a logical deduction function in propositional logic. I also demonstrate that the best-known edit generation function, written here as FH (standing for Fellegi-Holt), is essentially the same as propositional resolution deduction. Since there are many automated implementations of propositional resolution, the equivalence of FH with propositional resolution gives some hope that the covering set method might be implementable with automated logic tools. However, before any implementation, the other main aspect of the covering set method must also be formalised in terms of logic. This other aspect, to be called covering set correctibility, is the property that must be obeyed by the edit generation function if the covering set method is to successfully solve the error localisation problem. In this thesis I demonstrate that covering set correctibility is a strengthening of the well-known logical properties of soundness and refutation completeness. What is more, the proofs of the covering set correctibility of FH and of the soundness / completeness of resolution deduction have strong parallels: while the proof of soundness / completeness depends on the reduction property for counter-examples, the proof of covering set correctibility depends on the related lifting property. In this thesis I also use the lifting property to prove the covering set correctibility of the function defined by the Field Code Forest Algorithm. In so doing, I prove that the Field Code Forest Algorithm, whose correctness has been questioned, is indeed correct. The results about edit generation functions and covering set correctibility apply to both categorical edits (edits about discrete data) and arithmetic edits (edits expressible as linear inequalities). Thus this thesis gives the beginnings of a theoretical logical framework for error localisation, which might give new insights to the problem. In addition, the new insights will help develop new tools using automated logic tools. What is more, the strong parallels between the covering set method and aspects of logic are of aesthetic appeal.
APA, Harvard, Vancouver, ISO, and other styles
36

Lamer, Antoine. "Contribution à la prévention des risques liés à l’anesthésie par la valorisation des informations hospitalières au sein d’un entrepôt de données." Thesis, Lille 2, 2015. http://www.theses.fr/2015LIL2S021/document.

Full text
Abstract:
Introduction Le Système d'Information Hospitalier (SIH) exploite et enregistre chaque jours des millions d'informations liées à la prise en charge des patients : résultats d'analyses biologiques, mesures de paramètres physiologiques, administrations de médicaments, parcours dans les unités de soins, etc... Ces données sont traitées par des applications opérationnelles dont l'objectif est d'assurer un accès distant et une vision complète du dossier médical des patients au personnel médical. Ces données sont maintenant aussi utilisées pour répondre à d'autres objectifs comme la recherche clinique ou la santé publique, en particulier en les intégrant dans un entrepôt de données. La principale difficulté de ce type de projet est d'exploiter des données dans un autre but que celui pour lequel elles ont été enregistrées. Plusieurs études ont mis en évidence un lien statistique entre le respect d'indicateurs de qualité de prise en charge de l'anesthésie et le devenir du patient au cours du séjour hospitalier. Au CHRU de Lille, ces indicateurs de qualité, ainsi que les comorbidités du patient lors de la période post-opératoire pourraient être calculés grâce aux données recueillies par plusieurs applications du SIH. L'objectif de se travail est d'intégrer les données enregistrées par ces applications opérationnelles afin de pouvoir réaliser des études de recherche clinique.Méthode Dans un premier temps, la qualité des données enregistrées dans les systèmes sources est évaluée grâce aux méthodes présentées par la littérature ou développées dans le cadre ce projet. Puis, les problèmes de qualité mis en évidence sont traités lors de la phase d'intégration dans l'entrepôt de données. De nouvelles données sont calculées et agrégées afin de proposer des indicateurs de qualité de prise en charge. Enfin, deux études de cas permettent de tester l'utilisation du système développée.Résultats Les données pertinentes des applications du SIH ont été intégrées au sein d'un entrepôt de données d'anesthésie. Celui-ci répertorie les informations liées aux séjours hospitaliers et aux interventions réalisées depuis 2010 (médicaments administrées, étapes de l'intervention, mesures, parcours dans les unités de soins, ...) enregistrées par les applications sources. Des données agrégées ont été calculées et ont permis de mener deux études recherche clinique. La première étude a permis de mettre en évidence un lien statistique entre l'hypotension liée à l'induction de l'anesthésie et le devenir du patient. Des facteurs prédictifs de cette hypotension ont également étaient établis. La seconde étude a évalué le respect d'indicateurs de ventilation du patient et l'impact sur les comorbidités du système respiratoire.Discussion The data warehouse L'entrepôt de données développé dans le cadre de ce travail, et les méthodes d'intégration et de nettoyage de données mises en places permettent de conduire des analyses statistiques rétrospectives sur plus de 200 000 interventions. Le système pourra être étendu à d'autres systèmes sources au sein du CHRU de Lille mais également aux feuilles d'anesthésie utilisées par d'autres structures de soins
Introduction Hospital Information Systems (HIS) manage and register every day millions of data related to patient care: biological results, vital signs, drugs administrations, care process... These data are stored by operational applications provide remote access and a comprehensive picture of Electronic Health Record. These data may also be used to answer to others purposes as clinical research or public health, particularly when integrated in a data warehouse. Some studies highlighted a statistical link between the compliance of quality indicators related to anesthesia procedure and patient outcome during the hospital stay. In the University Hospital of Lille, the quality indicators, as well as the patient comorbidities during the post-operative period could be assessed with data collected by applications of the HIS. The main objective of the work is to integrate data collected by operational applications in order to realize clinical research studies.Methods First, the data quality of information registered by the operational applications is evaluated with methods … by the literature or developed in this work. Then, data quality problems highlighted by the evaluation are managed during the integration step of the ETL process. New data are computed and aggregated in order to dispose of indicators of quality of care. Finally, two studies bring out the usability of the system.Results Pertinent data from the HIS have been integrated in an anesthesia data warehouse. This system stores data about the hospital stay and interventions (drug administrations, vital signs …) since 2010. Aggregated data have been developed and used in two clinical research studies. The first study highlighted statistical link between the induction and patient outcome. The second study evaluated the compliance of quality indicators of ventilation and the impact on comorbity.Discussion The data warehouse and the cleaning and integration methods developed as part of this work allow performing statistical analysis on more than 200 000 interventions. This system can be implemented with other applications used in the CHRU of Lille but also with Anesthesia Information Management Systems used by other hospitals
APA, Harvard, Vancouver, ISO, and other styles
37

Pabarškaitė, Židrina. "Enhancements of pre-processing, analysis and presentation techniques in web log mining." Doctoral thesis, Lithuanian Academic Libraries Network (LABT), 2009. http://vddb.library.lt/obj/LT-eLABa-0001:E.02~2009~D_20090713_142203-05841.

Full text
Abstract:
As Internet is becoming an important part of our life, more attention is paid to the information quality and how it is displayed to the user. The research area of this work is web data analysis and methods how to process this data. This knowledge can be extracted by gathering web servers’ data – log files, where all users’ navigational patters about browsing are recorded. The research object of the dissertation is web log data mining process. General topics that are related with this object: web log data preparation methods, data mining algorithms for prediction and classification tasks, web text mining. The key target of the thesis is to develop methods how to improve knowledge discovery steps mining web log data that would reveal new opportunities to the data analyst. While performing web log analysis, it was discovered that insufficient interest has been paid to web log data cleaning process. By reducing the number of redundant records data mining process becomes much more effective and faster. Therefore a new original cleaning framework was introduced which leaves records that only corresponds to the real user clicks. People tend to understand technical information more if it is similar to a human language. Therefore it is advantageous to use decision trees for mining web log data, as they generate web usage patterns in the form of rules which are understandable to humans. However, it was discovered that users browsing history length is different, therefore specific data... [to full text]
Internetui skverbiantis į mūsų gyvenimą, vis didesnis dėmesys kreipiamas į informacijos pateikimo kokybę, bei į tai, kaip informacija yra pateikta. Disertacijos tyrimų sritis yra žiniatinklio serverių kaupiamų duomenų gavyba bei duomenų pateikimo galutiniam naudotojui gerinimo būdai. Tam reikalingos žinios išgaunamos iš žiniatinklio serverio žurnalo įrašų, kuriuose fiksuojama informacija apie išsiųstus vartotojams žiniatinklio puslapius. Darbo tyrimų objektas yra žiniatinklio įrašų gavyba, o su šiuo objektu susiję dalykai: žiniatinklio duomenų paruošimo etapų tobulinimas, žiniatinklio tekstų analizė, duomenų analizės algoritmai prognozavimo ir klasifikavimo uždaviniams spręsti. Pagrindinis disertacijos tikslas – perprasti svetainių naudotojų elgesio formas, tiriant žiniatinklio įrašus, tobulinti paruošimo, analizės ir rezultatų interpretavimo etapų metodologijas. Darbo tyrimai atskleidė naujas žiniatinklio duomenų analizės galimybes. Išsiaiškinta, kad internetinių duomenų – žiniatinklio įrašų švarinimui buvo skirtas nepakankamas dėmesys. Parodyta, kad sumažinus nereikšmingų įrašų kiekį, duomenų analizės procesas tampa efektyvesnis. Todėl buvo sukurtas naujas metodas, kurį pritaikius žinių pateikimas atitinka tikruosius vartotojų maršrutus. Tyrimo metu nustatyta, kad naudotojų naršymo istorija yra skirtingų ilgių, todėl atlikus specifinį duomenų paruošimą – suformavus fiksuoto ilgio vektorius, tikslinga taikyti iki šiol nenaudotus praktikoje sprendimų medžių algoritmus... [toliau žr. visą tekstą]
APA, Harvard, Vancouver, ISO, and other styles
38

Pinha, André Teixeira. "Monitoramento de doadores de sangue através de integração de bases de texto heterogêneas." reponame:Repositório Institucional da UFABC, 2016.

Find full text
Abstract:
Orientador: Prof. Dr. Márcio Katsumi Oikawa
Dissertação (mestrado) - Universidade Federal do ABC, Programa de Pós-Graduação em Ciência da Computação, 2016.
Através do relacionamento probabilístico de bases de dados é possível obter informações que a análise individual ou manual de bases de dados não proporcionaria. Esse trabalho visa encontrar, através do relacionamento probabilístico de registros, doadores de sangue da base de dados da Fundação Pró-Sangue (FPS) no Sistema de Informações sobre Mortalidade (SIM), nos anos de 2001 a 2006, favorecendo assim a manutenção de hemoderivados da instituição, inferindo se determinado doador veio à óbito. Para tal, foram avaliadas a eficiência de diferentes chaves de blocking que foram aplicadas em um conjunto de softwares gratuitos de record linkage e no software implementado para uso específico do estudo, intitulado SortedLink. Nos estudos, os registros foram padronizados e apenas os que possuíam dados da mãe cadastrados foram utilizados. Para avaliar a eficiência das chaves de blocking, foram selecionados 100.000 registros aleatoriamente das bases de dados SIM e FPS, e adicionados 30 registros de validação para cada conjunto. Sendo que o software SortedLink, implementado no trabalho, foi o que apresentou os melhores resultados e foi utilizado para obter os resultados dos possíveis pares de registros na base total de dados, 1.709.819 de registros para o SIM e 334.077 para o FPS. Além disso, o estudo também avalia a eficiência dos algoritmos de codificação fonética SOUNDEX, tipicamente utilizado no processo de record linkage, e do BRSOUND, desenvolvido para codificação de nomes e sobrenomes oriundos da língua portuguesa do Brasil.
Through probabilistic record linkage of databases is possible to obtain information that the individual or manual analysis of databases do not provide. This work aims to find, through probabilistic record relationship, blood donors from the database of Fundação Pró-Sangue (FPS) in the Sistema de Informações sobre Mortalidade (SIM) from Brazil, in the year 2001 to 2006, thus favoring maintenance blood products of the institution, inferring whether a donor came to death. For this purpose, we evaluated the effectiveness of different blocking keys that were applied to a set of free software record linkage and a software implemented for specific use of the study, entitled SortedLink. In the studies, the records were standardized and only those who had registered mother information were used. To assess the effectiveness of blocking keys were selected randomly 100, 000 records of SIM and FPS databases, and added 30 validation records for each set. Since the SortedLink software, implemented in this work, showed the best results, it was used to obtain the results of the possible pairs of records in the total database, 1.709.819 records from SIM and 334.077 from FPS. In addition, the study also evaluated the efficiency of SOUNDEX phonetic encoding algorithms, typically used in the record linkage process and the BRSOUND, developed for encoding names and surnames derived from the Portuguese language of Brazil.
APA, Harvard, Vancouver, ISO, and other styles
39

Andrade, Tiago Luís de [UNESP]. "Ambiente independente de idioma para suporte a identificação de tuplas duplicadas por meio da similaridade fonética e numérica: otimização de algoritmo baseado em multithreading." Universidade Estadual Paulista (UNESP), 2011. http://hdl.handle.net/11449/98678.

Full text
Abstract:
Made available in DSpace on 2014-06-11T19:29:40Z (GMT). No. of bitstreams: 0 Previous issue date: 2011-08-05Bitstream added on 2014-06-13T19:38:58Z : No. of bitstreams: 1 andrade_tl_me_sjrp.pdf: 1077520 bytes, checksum: 1573dc8642ce7969baffac2fd03d22fb (MD5)
Com o objetivo de garantir maior confiabilidade e consistência dos dados armazenados em banco de dados, a etapa de limpeza de dados está situada no início do processo de Descoberta de Conhecimento em Base de Dados (Knowledge Discovery in Database - KDD). Essa etapa tem relevância significativa, pois elimina problemas que refletem fortemente na confiabilidade do conhecimento extraído, como valores ausentes, valores nulos, tuplas duplicadas e valores fora do domínio. Trata-se de uma etapa importante que visa a correção e o ajuste dos dados para as etapas posteriores. Dentro dessa perspectiva, são apresentadas técnicas que buscam solucionar os diversos problemas mencionados. Diante disso, este trabalho tem como metodologia a caracterização da detecção de tuplas duplicadas em banco de dados, apresentação dos principais algoritmos baseados em métricas de distância, algumas ferramentas destinadas para tal atividade e o desenvolvimento de um algoritmo para identificação de registros duplicados baseado em similaridade fonética e numérica independente de idioma, desenvolvido por meio da funcionalidade multithreading para melhorar o desempenho em relação ao tempo de execução do algoritmo. Os testes realizados demonstram que o algoritmo proposto obteve melhores resultados na identificação de registros duplicados em relação aos algoritmos fonéticos existentes, fato este que garante uma melhor limpeza da base de dados
In order to ensure greater reliability and consistency of data stored in the database, the data cleaning stage is set early in the process of Knowledge Discovery in Database - KDD. This step has significant importance because it eliminates problems that strongly reflect the reliability of the knowledge extracted as missing values, null values, duplicate tuples and values outside the domain. It is an important step aimed at correction and adjustment for the subsequent stages. Within this perspective, techniques are presented that seek to address the various problems mentioned. Therefore, this work is the characterization method of detecting duplicate tuples in the database, presenting the main algorithms based on distance metrics, some tools designed for such activity and the development of an algorithm to identify duplicate records based on phonetic similarity numeric and language-independent, developed by multithreading functionality to improve performance over the runtime of the algorithm. Tests show that the proposed algorithm achieved better results in identifying duplicate records regarding phonetic algorithms exist, a fact that ensures better cleaning of the database
APA, Harvard, Vancouver, ISO, and other styles
40

Vavruška, Marek. "Realised stochastic volatility in practice." Master's thesis, Vysoká škola ekonomická v Praze, 2012. http://www.nusl.cz/ntk/nusl-165381.

Full text
Abstract:
Realised Stochastic Volatility model of Koopman and Scharth (2011) is applied to the five stocks listed on NYSE in this thesis. Aim of this thesis is to investigate the effect of speeding up the trade data processing by skipping the cleaning rule requiring the quote data. The framework of the Realised Stochastic Volatility model allows the realised measures to be biased estimates of the integrated volatility, which further supports this approach. The number of errors in recorded trades has decreased significantly during the past years. Different sample lengths were used to construct one day-ahead forecasts of realised measures to examine the forecast precision sensitivity to the rolling window length. Use of the longest window length does not lead to the lowest mean square error. The dominance of the Realised Stochastic Volatility model in terms of the lowest mean square errors of one day-ahead out-of-sample forecasts has been confirmed.
APA, Harvard, Vancouver, ISO, and other styles
41

Zaidi, Houda. "Amélioration de la qualité des données : correction sémantique des anomalies inter-colonnes." Thesis, Paris, CNAM, 2017. http://www.theses.fr/2017CNAM1094/document.

Full text
Abstract:
La qualité des données présente un grand enjeu au sein d'une organisation et influe énormément sur la qualité de ses services et sur sa rentabilité. La présence de données erronées engendre donc des préoccupations importantes autour de cette qualité. Ce rapport traite la problématique de l'amélioration de la qualité des données dans les grosses masses de données. Notre approche consiste à aider l'utilisateur afin de mieux comprendre les schémas des données manipulées, mais aussi définir les actions à réaliser sur celles-ci. Nous abordons plusieurs concepts tels que les anomalies des données au sein d'une même colonne, et les anomalies entre les colonnes relatives aux dépendances fonctionnelles. Nous proposons dans ce contexte plusieurs moyens de pallier ces défauts en nous intéressons à la performance des traitements ainsi opérés
Data quality represents a major challenge because the cost of anomalies can be very high especially for large databases in enterprises that need to exchange information between systems and integrate large amounts of data. Decision making using erroneous data has a bad influence on the activities of organizations. Quantity of data continues to increase as well as the risks of anomalies. The automatic correction of these anomalies is a topic that is becoming more important both in business and in the academic world. In this report, we propose an approach to better understand the semantics and the structure of the data. Our approach helps to correct automatically the intra-column anomalies and the inter-columns ones. We aim to improve the quality of data by processing the null values and the semantic dependencies between columns
APA, Harvard, Vancouver, ISO, and other styles
42

Cugler, Daniel Cintra 1982. "Supporting the collection and curation of biological observation metadata = Apoio à coleta e curadoria de metadados de observações biológicas." [s.n.], 2014. http://repositorio.unicamp.br/jspui/handle/REPOSIP/275520.

Full text
Abstract:
Orientador: Claudia Maria Bauzer Medeiros
Tese (doutorado) - Universidade Estadual de Campinas, Instituto de Computação
Made available in DSpace on 2018-08-25T17:19:53Z (GMT). No. of bitstreams: 1 Cugler_DanielCintra_D.pdf: 12940611 bytes, checksum: 857c7cd0b3ea3c5da4930823438c55fa (MD5) Previous issue date: 2014
Resumo: Bancos de dados de observações biológicas contêm informações sobre ocorrências de um organismo ou um conjunto de organismos detectados em um determinado local e data, de acordo com alguma metodologia. Tais bancos de dados armazenam uma variedade de dados, em múltiplas escalas espaciais e temporais, incluindo imagens, mapas, sons, textos, etc. Estas inestimáveis informações podem ser utilizadas em uma ampla gama de pesquisas, por exemplo, aquecimento global, comportamento de espécies ou produção de alimentos. Todos estes estudos são baseados na análise dos registros e seus respectivos metadados. Na maioria das vezes, análises são iniciadas nos metadados, estes frequentemente utilizados para indexar os registros de observações. No entanto, dada a natureza das atividades de observação, metadados podem possuir problemas de qualidade, dificultando tais análises. Por exemplo, podem haver lacunas nos metadados (por exemplo, atributos faltantes ou registros insuficientes). Isto pode causar sérios problemas: em estudos em biodiversidade, por exemplo, problemas nos metadados relacionados a uma única espécie podem afetar o entendimento não apenas da espécie, mas de amplas interações ecológicas. Esta tese propõe um conjunto de processos para auxiliar na solução de problemas de qualidade em metadados. Enquanto abordagens anteriores enfocam em um dado aspecto do problema, esta tese provê uma arquitetura e algoritmos que englobam o ciclo completo da gerência de metadados de observações biológicas, que vai desde adquirir dados até recuperar registros na base de dados. Nossas contribuições estão divididas em duas categorias: (a) enriquecimento de dados e (b) limpeza de dados. Contribuições na categoria (a) proveem informação adicional para ambos atributos faltantes em registros existentes e registros faltantes para requisitos específicos. Nossas estratégias usam fontes de dados remotas oficiais e VGI (Volunteered Geographic Information) para enriquecer tais metadados, provendo as informações faltantes. Contribuições na categoria (b) detectam anomalias em metadados de observações biológicas através da execução de análises espaciais que contrastam a localização das observações com mapas oficiais de distribuição geográfica de espécies. Deste modo, as principais contribuições são: (i) uma arquitetura para recuperação de registros de observações biológicas, que deriva atributos faltantes através do uso de fontes de dados externas; (ii) uma abordagem espacial para detecção de anomalias e (iii) uma abordagem para aquisição adaptativa de VGI para preencher lacunas em metadados, utilizando dispositivos móveis e sensores. Estas contribuições foram validadas através da implementação de protótipos, utilizando como estudo de caso os desafios oriundos do gerenciamento de metadados de observações biológicas da Fonoteca Neotropical Jacques Vielliard (FNJV), uma das 10 maiores coleções de sons de animais do mundo
Abstract: Biological observation databases contain information about the occurrence of an organism or set of organisms detected at a given place and time according to some methodology. Such databases store a variety of data, at multiple spatial and temporal scales, including images, maps, sounds, texts and so on. This priceless information can be used in a wide range of research initiatives, e.g., global warming, species behavior or food production. All such studies are based on analyzing the records themselves, and their metadata. Most times, analyses start from metadata, often used to index the observation records. However, given the nature of observation activities, metadata may suffer from quality problems, hampering such analyses. For example, there may be metadata gaps (e.g., missing attributes, or insufficient records). This can have serious effects: in biodiversity studies, for instance, metadata problems regarding a single species can affect the understanding not just of the species, but of wider ecological interactions. This thesis proposes a set of processes to help solve problems in metadata quality. While previous approaches concern one given aspect of the problem, the thesis provides an architecture and algorithms that encompass the whole cycle of managing biological observation metadata, which goes from acquiring data to retrieving database records. Our contributions are divided into two categories: (a) data enrichment and (b) data cleaning. Contributions in category (a) provide additional information for both missing attributes in existent records, and missing records for specific requirements. Our strategies use authoritative remote data sources and VGI (Volunteered Geographic Information) to enrich such metadata, providing missing information. Contributions in category (b) detect anomalies in biological observation metadata by performing spatial analyses that contrast location of the observations with authoritative geographic distribution maps. Thus, the main contributions are: (i) an architecture to retrieve biological observation records, which derives missing attributes by using external data sources; (ii) a geographical approach for anomaly detection and (iii) an approach for adaptive acquisition of VGI to fill out metadata gaps, using mobile devices and sensors. These contributions were validated by actual implementations, using as case study the challenges presented by the management of biological observation metadata of the Fonoteca Neotropical Jacques Vielliard (FNJV), one of the top 10 animal sound collections in the world
Doutorado
Ciência da Computação
Doutor em Ciência da Computação
APA, Harvard, Vancouver, ISO, and other styles
43

Olejník, Tomáš. "Zpracování obchodních dat finančního trhu." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2011. http://www.nusl.cz/ntk/nusl-412828.

Full text
Abstract:
The master's thesis' objective is to study basics of high-frequency trading, especially trading at foreign exchange market. Project deals with foreign exchange data preprocessing, fundamentals of market data collecting, data storing and cleaning are discussed. Doing decisions based on poor quality data can lead into fatal consequences in money business therefore data cleaning is necessary. The thesis describes adaptive data cleaning algorithm which is able to adapt current market conditions. According to design a modular plug-in application for data collecting, storing and following cleaning has been implemented.
APA, Harvard, Vancouver, ISO, and other styles
44

Norng, Sorn. "Statistical decisions in optimising grain yield." Thesis, Queensland University of Technology, 2004. https://eprints.qut.edu.au/15806/1/Sorn_Norng_Thesis.pdf.

Full text
Abstract:
This thesis concerns Precision Agriculture (PA) technology which involves methods developed to optimise grain yield by examining data quality and modelling protein/yield relationship of wheat and sorghum fields in central and southern Queensland. An important part of developing strategies to optimisise grain yield is the understanding of PA technology. This covers major aspects of PA which includes all the components of Site- Specific Crop Management System (SSCM). These components are 1. Spatial referencing, 2. Crop, soil and climate monitoring, 3. Attribute mapping, 4. Decision suppport systems and 5. Differential action. Understanding how all five components fit into PA significantly aids the development of data analysis methods. The development of PA is dependent on the collection, analysis and interpretation of information. A preliminary data analysis step is described which covers both non-spatial and spatial data analysis methods. The non-spatial analysis involves plotting methods (maps, histograms), standard distribution and statistical summary (mean, standard deviation). The spatial analysis covers both undirected and directional variogram analyses. In addition to the data analysis, a theoretical investigation into GPS error is given. GPS plays a major role in the development of PA. A number of sources of errors affect the GPS and therefore effect the positioning measurements. Therefore, an understanding of the distribution of the errors and how they are related to each other over time is needed to complement the understanding of the nature of the data. Understanding the error distribution and the data give useful insights for model assumptions in regard to position measurement errors. A review of filtering methods is given and new methods are developed, namely, strip analysis and a double harvesting algoritm. These methods are designed specifically for controlled traffic and normal traffic respectively but can be applied to all kinds of yield monitoring data. The data resulting from the strip analysis and double harvesting algorithm are used in investigating the relationship between on-the-go yield and protein. The strategy is to use protein and yield in determining decisions with respect to nitrogen managements. The agronomic assumption is that protein and yield have a significant relationship based on plot trials. We investigate whether there is any significant relationship between protein and yield at the local level to warrent this kind of assumption. Understanding PA technology and being aware of the sources of errors that exist in data collection and data analysis are all very important in the steps of developing management decision strategies.
APA, Harvard, Vancouver, ISO, and other styles
45

Norng, Sorn. "Statistical decisions in optimising grain yield." Queensland University of Technology, 2004. http://eprints.qut.edu.au/15806/.

Full text
Abstract:
This thesis concerns Precision Agriculture (PA) technology which involves methods developed to optimise grain yield by examining data quality and modelling protein/yield relationship of wheat and sorghum fields in central and southern Queensland. An important part of developing strategies to optimisise grain yield is the understanding of PA technology. This covers major aspects of PA which includes all the components of Site- Specific Crop Management System (SSCM). These components are 1. Spatial referencing, 2. Crop, soil and climate monitoring, 3. Attribute mapping, 4. Decision suppport systems and 5. Differential action. Understanding how all five components fit into PA significantly aids the development of data analysis methods. The development of PA is dependent on the collection, analysis and interpretation of information. A preliminary data analysis step is described which covers both non-spatial and spatial data analysis methods. The non-spatial analysis involves plotting methods (maps, histograms), standard distribution and statistical summary (mean, standard deviation). The spatial analysis covers both undirected and directional variogram analyses. In addition to the data analysis, a theoretical investigation into GPS error is given. GPS plays a major role in the development of PA. A number of sources of errors affect the GPS and therefore effect the positioning measurements. Therefore, an understanding of the distribution of the errors and how they are related to each other over time is needed to complement the understanding of the nature of the data. Understanding the error distribution and the data give useful insights for model assumptions in regard to position measurement errors. A review of filtering methods is given and new methods are developed, namely, strip analysis and a double harvesting algoritm. These methods are designed specifically for controlled traffic and normal traffic respectively but can be applied to all kinds of yield monitoring data. The data resulting from the strip analysis and double harvesting algorithm are used in investigating the relationship between on-the-go yield and protein. The strategy is to use protein and yield in determining decisions with respect to nitrogen managements. The agronomic assumption is that protein and yield have a significant relationship based on plot trials. We investigate whether there is any significant relationship between protein and yield at the local level to warrent this kind of assumption. Understanding PA technology and being aware of the sources of errors that exist in data collection and data analysis are all very important in the steps of developing management decision strategies.
APA, Harvard, Vancouver, ISO, and other styles
46

Andrade, Tiago Luís de. "Ambiente independente de idioma para suporte a identificação de tuplas duplicadas por meio da similaridade fonética e numérica: otimização de algoritmo baseado em multithreading /." São José do Rio Preto : [s.n.], 2011. http://hdl.handle.net/11449/98678.

Full text
Abstract:
Resumo: Com o objetivo de garantir maior confiabilidade e consistência dos dados armazenados em banco de dados, a etapa de limpeza de dados está situada no início do processo de Descoberta de Conhecimento em Base de Dados (Knowledge Discovery in Database - KDD). Essa etapa tem relevância significativa, pois elimina problemas que refletem fortemente na confiabilidade do conhecimento extraído, como valores ausentes, valores nulos, tuplas duplicadas e valores fora do domínio. Trata-se de uma etapa importante que visa a correção e o ajuste dos dados para as etapas posteriores. Dentro dessa perspectiva, são apresentadas técnicas que buscam solucionar os diversos problemas mencionados. Diante disso, este trabalho tem como metodologia a caracterização da detecção de tuplas duplicadas em banco de dados, apresentação dos principais algoritmos baseados em métricas de distância, algumas ferramentas destinadas para tal atividade e o desenvolvimento de um algoritmo para identificação de registros duplicados baseado em similaridade fonética e numérica independente de idioma, desenvolvido por meio da funcionalidade multithreading para melhorar o desempenho em relação ao tempo de execução do algoritmo. Os testes realizados demonstram que o algoritmo proposto obteve melhores resultados na identificação de registros duplicados em relação aos algoritmos fonéticos existentes, fato este que garante uma melhor limpeza da base de dados
Abstract: In order to ensure greater reliability and consistency of data stored in the database, the data cleaning stage is set early in the process of Knowledge Discovery in Database - KDD. This step has significant importance because it eliminates problems that strongly reflect the reliability of the knowledge extracted as missing values, null values, duplicate tuples and values outside the domain. It is an important step aimed at correction and adjustment for the subsequent stages. Within this perspective, techniques are presented that seek to address the various problems mentioned. Therefore, this work is the characterization method of detecting duplicate tuples in the database, presenting the main algorithms based on distance metrics, some tools designed for such activity and the development of an algorithm to identify duplicate records based on phonetic similarity numeric and language-independent, developed by multithreading functionality to improve performance over the runtime of the algorithm. Tests show that the proposed algorithm achieved better results in identifying duplicate records regarding phonetic algorithms exist, a fact that ensures better cleaning of the database
Orientador: Carlos Roberto Valêncio
Coorientador: Maurizio Babini
Banca: Pedro Luiz Pizzigatti Corrêa
Banca: José Márcio Machado
Mestre
APA, Harvard, Vancouver, ISO, and other styles
47

Zaidi, Houda. "Amélioration de la qualité des données : correction sémantique des anomalies inter-colonnes." Electronic Thesis or Diss., Paris, CNAM, 2017. http://www.theses.fr/2017CNAM1094.

Full text
Abstract:
La qualité des données présente un grand enjeu au sein d'une organisation et influe énormément sur la qualité de ses services et sur sa rentabilité. La présence de données erronées engendre donc des préoccupations importantes autour de cette qualité. Ce rapport traite la problématique de l'amélioration de la qualité des données dans les grosses masses de données. Notre approche consiste à aider l'utilisateur afin de mieux comprendre les schémas des données manipulées, mais aussi définir les actions à réaliser sur celles-ci. Nous abordons plusieurs concepts tels que les anomalies des données au sein d'une même colonne, et les anomalies entre les colonnes relatives aux dépendances fonctionnelles. Nous proposons dans ce contexte plusieurs moyens de pallier ces défauts en nous intéressons à la performance des traitements ainsi opérés
Data quality represents a major challenge because the cost of anomalies can be very high especially for large databases in enterprises that need to exchange information between systems and integrate large amounts of data. Decision making using erroneous data has a bad influence on the activities of organizations. Quantity of data continues to increase as well as the risks of anomalies. The automatic correction of these anomalies is a topic that is becoming more important both in business and in the academic world. In this report, we propose an approach to better understand the semantics and the structure of the data. Our approach helps to correct automatically the intra-column anomalies and the inter-columns ones. We aim to improve the quality of data by processing the null values and the semantic dependencies between columns
APA, Harvard, Vancouver, ISO, and other styles
48

Abraham, Lukáš. "Analýza dat síťové komunikace mobilních zařízení." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2020. http://www.nusl.cz/ntk/nusl-432938.

Full text
Abstract:
At the beginning, the work describes DNS and SSL/TLS protocols, it mainly deals with communication between devices using these protocols. Then we'll talk about data preprocessing and data cleaning. Furthermore, the thesis deals with basic data mining techniques such as data classification, association rules, information retrieval, regression analysis and cluster analysis. The next chapter we can read something about how to identify mobile devices on the network. We will evaluate data sets that contain collected data from communication between the above mentioned protocols, which will be used in the practical part. After that, we finally get to the design of a system for analyzing network communication data. We will describe the libraries, which we used and the entire system implementation. We will perform a large number of experiments, which we will finally evaluate.
APA, Harvard, Vancouver, ISO, and other styles
49

Concina, Alessandro <1990&gt. "Data Cleansing, different approaches for different problems." Master's Degree Thesis, Università Ca' Foscari Venezia, 2016. http://hdl.handle.net/10579/8029.

Full text
APA, Harvard, Vancouver, ISO, and other styles
50

Grillo, Aderibigbe. "Developing a data quality scorecard that measures data quality in a data warehouse." Thesis, Brunel University, 2018. http://bura.brunel.ac.uk/handle/2438/17137.

Full text
Abstract:
The main purpose of this thesis is to develop a data quality scorecard (DQS) that aligns the data quality needs of the Data warehouse stakeholder group with selected data quality dimensions. To comprehend the research domain, a general and systematic literature review (SLR) was carried out, after which the research scope was established. Using Design Science Research (DSR) as the methodology to structure the research, three iterations were carried out to achieve the research aim highlighted in this thesis. In the first iteration, as DSR was used as a paradigm, the artefact was build from the results of the general and systematic literature review conduct. A data quality scorecard (DQS) was conceptualised. The result of the SLR and the recommendations for designing an effective scorecard provided the input for the development of the DQS. Using a System Usability Scale (SUS), to validate the usability of the DQS, the results of the first iteration suggest that the DW stakeholders found the DQS useful. The second iteration was conducted to further evaluate the DQS through a run through in the FMCG domain and then conducting a semi-structured interview. The thematic analysis of the semi-structured interviews demonstrated that the stakeholder's participants' found the DQS to be transparent; an additional reporting tool; Integrates; easy to use; consistent; and increases confidence in the data. However, the timeliness data dimension was found to be redundant, necessitating a modification to the DQS. The third iteration was conducted with similar steps as the second iteration but with the modified DQS in the oil and gas domain. The results from the third iteration suggest that DQS is a useful tool that is easy to use on a daily basis. The research contributes to theory by demonstrating a novel approach to DQS design This was achieved by ensuring the design of the DQS aligns with the data quality concern areas of the DW stakeholders and the data quality dimensions. Further, this research lay a good foundation for the future by establishing a DQS model that can be used as a base for further development.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography