Relevant bibliographies by topics / Data cleaning

Academic literature on the topic 'Data cleaning'

Author: Grafiati

Published: 4 June 2021

Last updated: 25 May 2024

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Journal articles
Dissertations / Theses
Books
Book chapters
Conference papers
Reports

Consult the lists of relevant articles, books, theses, conference reports, and other scholarly sources on the topic 'Data cleaning.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Journal articles on the topic "Data cleaning"

Pahwa, Payal, and Rashmi Chhabra. "BST Algorithm for Duplicate Elimination in Data Warehouse." INTERNATIONAL JOURNAL OF MANAGEMENT & INFORMATION TECHNOLOGY 4, no. 1 (June 26, 2013): 190–97. http://dx.doi.org/10.24297/ijmit.v4i1.4636.

Full text

Abstract:

Data warehousing is an emerging technology and has proved to be very important for an organization. Today every business organization needs accurate and large amount of information to make proper decisions. For taking the business decisions the data should be of good quality. To improve the data quality data cleansing is needed. Data cleansing is fundamental to warehouse data reliability, and to data warehousing success. There are various methods for datacleansing. This paper addresses issues related data cleaning. We focus on the detection of duplicate records. Also anefficient algorithm for data cleaning is proposed. A review of data cleansing methods and comparison between them is presented.

APA, Harvard, Vancouver, ISO, and other styles

Chu, Xu, and Ihab F. Ilyas. "Qualitative data cleaning." Proceedings of the VLDB Endowment 9, no. 13 (September 2016): 1605–8. http://dx.doi.org/10.14778/3007263.3007320.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Broman, Karl W. "Cleaning genotype data." Genetic Epidemiology 17, S1 (1999): S79—S83. http://dx.doi.org/10.1002/gepi.1370170714.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Elvin Jafarov, Elvin Jafarov. "DATA CLEANING BEFORE UPLOADING TO STORAGE." ETM - Equipment, Technologies, Materials 13, no. 01 (February 7, 2023): 117–27. http://dx.doi.org/10.36962/etm13012023-117.

Full text

Abstract:

The article considered the issue of cleaning big data before uploading it to storage. At this time, the errors made and the methods of eliminating these errors have been clarified. The technology of creating a big data storage and analysis system is reviewed, as well as solutions for the implementation of the first stages of the Data Science process: data acquisition, cleaning and loading are described. The results of the research allow us to move towards the realization of future steps in the field of big data processing. It was noted that Data cleansing is an essential step in working with big data, as any analysis based on inaccurate data can lead to erroneous results. Also, it was noted that cleaning and consolidation of data can also be performed when the data is loaded into a distributed file system. The methods of uploading data to the storage system have been tested. An assembly from Hortonworks was used as the implementation. The easiest way to upload is to use the web interface of the Ambari system or to use HDFS commands to upload to HDFS Hadoop from the local system. It has been shown that the ETL process should be considered more broadly than just importing data from receivers, minimal transformations and loading procedures into the warehouse. Data cleaning should become a mandatory stage of work, because the cost of storage is determined not only by the amount of data, but also by the quality of the information collected. Keywords: Big Data, Data Cleaning, Storage System, ETL process, Loading methods.

APA, Harvard, Vancouver, ISO, and other styles

Singh, Mohini. "Cleaning Up Company Data." CFA Institute Magazine 27, no. 1 (March 2016): 53. http://dx.doi.org/10.2469/cfm.v27.n1.18.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Holstad, Mark S. "Data Driven Interceptor Cleaning." Proceedings of the Water Environment Federation 2010, no. 8 (January 1, 2010): 7636–64. http://dx.doi.org/10.2175/193864710798207792.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Zhang, Aoqian, Shaoxu Song, Jianmin Wang, and Philip S. Yu. "Time series data cleaning." Proceedings of the VLDB Endowment 10, no. 10 (June 2017): 1046–57. http://dx.doi.org/10.14778/3115404.3115410.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Geerts, Floris, Giansalvatore Mecca, Paolo Papotti, and Donatello Santoro. "Cleaning data with Llunatic." VLDB Journal 29, no. 4 (November 8, 2019): 867–92. http://dx.doi.org/10.1007/s00778-019-00586-5.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Karr, Alan F. "Exploratory Data Mining and Data Cleaning." Journal of the American Statistical Association 101, no. 473 (March 2006): 399. http://dx.doi.org/10.1198/jasa.2006.s81.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Borrohou, Sanae, Rachida Fissoune, and Hassan Badir. "Data cleaning survey and challenges – improving outlier detection algorithm in machine learning." Journal of Smart Cities and Society 2, no. 3 (October 9, 2023): 125–40. http://dx.doi.org/10.3233/scs-230008.

Full text

Abstract:

Data cleaning, also referred to as data cleansing, constitutes a pivotal phase in data processing subsequent to data collection. Its primary objective is to identify and eliminate incomplete data, duplicates, outdated information, anomalies, missing values, and errors. The influence of data quality on the effectiveness of machine learning (ML) models is widely acknowledged, prompting data scientists to dedicate substantial effort to data cleaning prior to model training. This study accentuates critical facets of data cleaning and the utilization of outlier detection algorithms. Additionally, our investigation encompasses the evaluation of prominent outlier detection algorithms through benchmarking, seeking to identify an efficient algorithm boasting consistent performance. As the culmination of our research, we introduce an innovative algorithm centered on the fusion of Isolation Forest and clustering techniques. By leveraging the strengths of both methods, this proposed algorithm aims to enhance outlier detection outcomes. This work endeavors to elucidate the multifaceted importance of data cleaning, underscored by its symbiotic relationship with ML models. Furthermore, our exploration of outlier detection methodologies aligns with the broader objective of refining data processing and analysis paradigms. Through the convergence of theoretical insights, algorithmic exploration, and innovative proposals, this study contributes to the advancement of data cleaning and outlier detection techniques in the realm of contemporary data-driven environments.

APA, Harvard, Vancouver, ISO, and other styles

More sources

Dissertations / Theses on the topic "Data cleaning"

Liebchen, Gernot Armin. "Data cleaning techniques for software engineering data sets." Thesis, Brunel University, 2010. http://bura.brunel.ac.uk/handle/2438/5951.

Full text

Abstract:

Data quality is an important issue which has been addressed and recognised in research communities such as data warehousing, data mining and information systems. It has been agreed that poor data quality will impact the quality of results of analyses and that it will therefore impact on decisions made on the basis of these results. Empirical software engineering has neglected the issue of data quality to some extent. This fact poses the question of how researchers in empirical software engineering can trust their results without addressing the quality of the analysed data. One widely accepted definition for data quality describes it as `fitness for purpose', and the issue of poor data quality can be addressed by either introducing preventative measures or by applying means to cope with data quality issues. The research presented in this thesis addresses the latter with the special focus on noise handling. Three noise handling techniques, which utilise decision trees, are proposed for application to software engineering data sets. Each technique represents a noise handling approach: robust filtering, where training and test sets are the same; predictive filtering, where training and test sets are different; and filtering and polish, where noisy instances are corrected. The techniques were first evaluated in two different investigations by applying them to a large real world software engineering data set. In the first investigation the techniques' ability to improve predictive accuracy in differing noise levels was tested. All three techniques improved predictive accuracy in comparison to the do-nothing approach. The filtering and polish was the most successful technique in improving predictive accuracy. The second investigation utilising the large real world software engineering data set tested the techniques' ability to identify instances with implausible values. These instances were flagged for the purpose of evaluation before applying the three techniques. Robust filtering and predictive filtering decreased the number of instances with implausible values, but substantially decreased the size of the data set too. The filtering and polish technique actually increased the number of implausible values, but it did not reduce the size of the data set. Since the data set contained historical software project data, it was not possible to know the real extent of noise detected. This led to the production of simulated software engineering data sets, which were modelled on the real data set used in the previous evaluations to ensure domain specific characteristics. These simulated versions of the data set were then injected with noise, such that the real extent of the noise was known. After the noise injection the three noise handling techniques were applied to allow evaluation. This procedure of simulating software engineering data sets combined the incorporation of domain specific characteristics of the real world with the control over the simulated data. This is seen as a special strength of this evaluation approach. The results of the evaluation of the simulation showed that none of the techniques performed well. Robust filtering and filtering and polish performed very poorly, and based on the results of this evaluation they would not be recommended for the task of noise reduction. The predictive filtering technique was the best performing technique in this evaluation, but it did not perform significantly well either. An exhaustive systematic literature review has been carried out investigating to what extent the empirical software engineering community has considered data quality. The findings showed that the issue of data quality has been largely neglected by the empirical software engineering community. The work in this thesis highlights an important gap in empirical software engineering. It provided clarification and distinctions of the terms noise and outliers. Noise and outliers are overlapping, but they are fundamentally different. Since noise and outliers are often treated the same in noise handling techniques, a clarification of the two terms was necessary. To investigate the capabilities of noise handling techniques a single investigation was deemed as insufficient. The reasons for this are that the distinction between noise and outliers is not trivial, and that the investigated noise cleaning techniques are derived from traditional noise handling techniques where noise and outliers are combined. Therefore three investigations were undertaken to assess the effectiveness of the three presented noise handling techniques. Each investigation should be seen as a part of a multi-pronged approach. This thesis also highlights possible shortcomings of current automated noise handling techniques. The poor performance of the three techniques led to the conclusion that noise handling should be integrated into a data cleaning process where the input of domain knowledge and the replicability of the data cleaning process are ensured.

APA, Harvard, Vancouver, ISO, and other styles

Li, Lin. "Data quality and data cleaning in database applications." Thesis, Edinburgh Napier University, 2012. http://researchrepository.napier.ac.uk/Output/5788.

Full text

Abstract:

Today, data plays an important role in people's daily activities. With the help of some database applications such as decision support systems and customer relationship management systems (CRM), useful information or knowledge could be derived from large quantities of data. However, investigations show that many such applications fail to work successfully. There are many reasons to cause the failure, such as poor system infrastructure design or query performance. But nothing is more certain to yield failure than lack of concern for the issue of data quality. High quality of data is a key to today's business success. The quality of any large real world data set depends on a number of factors among which the source of the data is often the crucial factor. It has now been recognized that an inordinate proportion of data in most data sources is dirty. Obviously, a database application with a high proportion of dirty data is not reliable for the purpose of data mining or deriving business intelligence and the quality of decisions made on the basis of such business intelligence is also unreliable. In order to ensure high quality of data, enterprises need to have a process, methodologies and resources to monitor and analyze the quality of data, methodologies for preventing and/or detecting and repairing dirty data. This thesis is focusing on the improvement of data quality in database applications with the help of current data cleaning methods. It provides a systematic and comparative description of the research issues related to the improvement of the quality of data, and has addressed a number of research issues related to data cleaning. In the first part of the thesis, related literature of data cleaning and data quality are reviewed and discussed. Building on this research, a rule-based taxonomy of dirty data is proposed in the second part of the thesis. The proposed taxonomy not only summarizes the most dirty data types but is the basis on which the proposed method for solving the Dirty Data Selection (DDS) problem during the data cleaning process was developed. This helps us to design the DDS process in the proposed data cleaning framework described in the third part of the thesis. This framework retains the most appealing characteristics of existing data cleaning approaches, and improves the efficiency and effectiveness of data cleaning as well as the degree of automation during the data cleaning process. Finally, a set of approximate string matching algorithms are studied and experimental work has been undertaken. Approximate string matching is an important part in many data cleaning approaches which has been well studied for many years. The experimental work in the thesis confirmed the statement that there is no clear best technique. It shows that the characteristics of data such as the size of a dataset, the error rate in a dataset, the type of strings in a dataset and even the type of typo in a string will have significant effect on the performance of the selected techniques. In addition, the characteristics of data also have effect on the selection of suitable threshold values for the selected matching algorithms. The achievements based on these experimental results provide the fundamental improvement in the design of 'algorithm selection mechanism' in the data cleaning framework, which enhances the performance of data cleaning system in database applications.

APA, Harvard, Vancouver, ISO, and other styles

Iyer, Vasanth. "Ensemble Stream Model for Data-Cleaning in Sensor Networks." FIU Digital Commons, 2013. http://digitalcommons.fiu.edu/etd/973.

Full text

Abstract:

Ensemble Stream Modeling and Data-cleaning are sensor information processing systems have different training and testing methods by which their goals are cross-validated. This research examines a mechanism, which seeks to extract novel patterns by generating ensembles from data. The main goal of label-less stream processing is to process the sensed events to eliminate the noises that are uncorrelated, and choose the most likely model without over fitting thus obtaining higher model confidence. Higher quality streams can be realized by combining many short streams into an ensemble which has the desired quality. The framework for the investigation is an existing data mining tool. First, to accommodate feature extraction such as a bush or natural forest-fire event we make an assumption of the burnt area (BA*), sensed ground truth as our target variable obtained from logs. Even though this is an obvious model choice the results are disappointing. The reasons for this are two: One, the histogram of fire activity is highly skewed. Two, the measured sensor parameters are highly correlated. Since using non descriptive features does not yield good results, we resort to temporal features. By doing so we carefully eliminate the averaging effects; the resulting histogram is more satisfactory and conceptual knowledge is learned from sensor streams. Second is the process of feature induction by cross-validating attributes with single or multi-target variables to minimize training error. We use F-measure score, which combines precision and accuracy to determine the false alarm rate of fire events. The multi-target data-cleaning trees use information purity of the target leaf-nodes to learn higher order features. A sensitive variance measure such as f-test is performed during each node’s split to select the best attribute. Ensemble stream model approach proved to improve when using complicated features with a simpler tree classifier. The ensemble framework for data-cleaning and the enhancements to quantify quality of fitness (30% spatial, 10% temporal, and 90% mobility reduction) of sensor led to the formation of streams for sensor-enabled applications. Which further motivates the novelty of stream quality labeling and its importance in solving vast amounts of real-time mobile streams generated today.

APA, Harvard, Vancouver, ISO, and other styles

Jia, Xibei. "From relations to XML : cleaning, integrating and securing data." Thesis, University of Edinburgh, 2008. http://hdl.handle.net/1842/3161.

Full text

Abstract:

While relational databases are still the preferred approach for storing data, XML is emerging as the primary standard for representing and exchanging data. Consequently, it has been increasingly important to provide a uniform XML interface to various data sources— integration; and critical to protect sensitive and confidential information in XML data — access control. Moreover, it is preferable to first detect and repair the inconsistencies in the data to avoid the propagation of errors to other data processing steps. In response to these challenges, this thesis presents an integrated framework for cleaning, integrating and securing data. The framework contains three parts. First, the data cleaning sub-framework makes use of a new class of constraints specially designed for improving data quality, referred to as conditional functional dependencies (CFDs), to detect and remove inconsistencies in relational data. Both batch and incremental techniques are developed for detecting CFD violations by SQL efficiently and repairing them based on a cost model. The cleaned relational data, together with other non-XML data, is then converted to XML format by using widely deployed XML publishing facilities. Second, the data integration sub-framework uses a novel formalism, XML integration grammars (XIGs), to integrate multi-source XML data which is either native or published from traditional databases. XIGs automatically support conformance to a target DTD, and allow one to build a large, complex integration via composition of component XIGs. To efficiently materialize the integrated data, algorithms are developed for merging XML queries in XIGs and for scheduling them. Third, to protect sensitive information in the integrated XML data, the data security sub-framework allows users to access the data only through authorized views. User queries posed on these views need to be rewritten into equivalent queries on the underlying document to avoid the prohibitive cost of materializing and maintaining large number of views. Two algorithms are proposed to support virtual XML views: a rewriting algorithm that characterizes the rewritten queries as a new form of automata and an evaluation algorithm to execute the automata-represented queries. They allow the security sub-framework to answer queries on views in linear time. Using both relational and XML technologies, this framework provides a uniform approach to clean, integrate and secure data. The algorithms and techniques in the framework have been implemented and the experimental study verifies their effectiveness and efficiency.

APA, Harvard, Vancouver, ISO, and other styles

Kokkonen, H. (Henna). "Effects of data cleaning on machine learning model performance." Bachelor's thesis, University of Oulu, 2019. http://jultika.oulu.fi/Record/nbnfioulu-201911133081.

Full text

Abstract:

Abstract. This thesis is focused on the preprocessing and challenges of a university student data set and how different levels of data preprocessing affect the performance of a prediction model both in general and in selected groups of interest. The data set comprises the students at the University of Oulu who were admitted to the Faculty of Information Technology and Electrical Engineering during years 2006–2015. This data set was cleaned at three different levels, which resulted in three differently processed data sets: one set is the original data set with only basic cleaning, the second has been cleaned out of the most obvious anomalies and the third has been systematically cleaned out of possible anomalies. Each of these data sets was used to build a Gradient Boosting Machine model that predicted the cumulative number of ECTS the students would achieve by the end of their second-year studies based on their first-year studies and the Matriculation Examination results. The effects of the cleaning on the model performance were examined by comparing the prediction accuracy and the information the models gave of the factors that might indicate a slow ECTS accumulation. The results showed that the prediction accuracy improved after each cleaning stage and the influences of the features altered significantly, becoming more reasonable.Datan siivouksen vaikutukset koneoppimismallin suorituskykyyn. Tiivistelmä. Tässä tutkielmassa keskitytään opiskelijadatan esikäsittelyyn ja haasteisiin sekä siihen, kuinka eritasoinen esikäsittely vaikuttaa ennustemallin suorituskykyyn sekä yleisesti että tietyissä kiinnostuksen kohteena olevissa ryhmissä. Opiskelijadata koostuu Oulun yliopiston Tieto- ja sähkötekniikan tiedekuntaan vuosina 2006–2015 valituista opiskelijoista. Tätä opiskelijadataa käsiteltiin kolmella eri tasolla, jolloin saatiin kolme eritasoisesti siivottua versiota alkuperäisestä datajoukosta. Ensimmäinen versio on alkuperäinen datajoukko, jolle on tehty vain perussiivous, toisessa versiossa datasta on poistettu vain ilmeisimmät poikkeavuudet ja kolmannessa versiossa datasta on systemaattisesti poistettu mahdolliset poikkeavuudet. Jokaisella datajoukolla opetettiin Gradient Boosting Machine koneoppismismalli ennustamaan opiskelijoiden opintopistekertymää toisen vuoden loppuun mennessä perustuen heidän ensimmäisen vuoden opintoihinsa ja ylioppilaskirjoitustensa tuloksiin. Datan eritasoisen siivouksen vaikutuksia mallin suorituskykyyn tutkittiin vertailemalla mallien ennustetarkkuutta sekä tietoa, jota mallit antoivat niistä tekijöistä, jotka voivat ennakoida hitaampaa opintopistekertymää. Tulokset osoittivat mallin ennustetarkkuuden parantuneen jokaisen käsittelytason jälkeen sekä mallin ennustajien vaikutusten muuttuneen järjellisemmiksi.

APA, Harvard, Vancouver, ISO, and other styles

Bischof, Stefan, Benedikt Kämpgen, Andreas Harth, Axel Polleres, and Patrik Schneider. "Open City Data Pipeline." Department für Informationsverarbeitung und Prozessmanagement, WU Vienna University of Economics and Business, 2017. http://epub.wu.ac.at/5438/1/city%2Dqb%2Dpaper.pdf.

Full text

Abstract:

Statistical data about cities, regions and at country level is collected for various purposes and from various institutions. Yet, while access to high quality and recent such data is crucial both for decision makers as well as for the public, all to often such collections of data remain isolated and not re-usable, let alone properly integrated. In this paper we present the Open City Data Pipeline, a focused attempt to collect, integrate, and enrich statistical data collected at city level worldwide, and republish this data in a reusable manner as Linked Data. The main feature of the Open City Data Pipeline are: (i) we integrate and cleanse data from several sources in a modular and extensible, always up-to-date fashion; (ii) we use both Machine Learning techniques as well as ontological reasoning over equational background knowledge to enrich the data by imputing missing values, (iii) we assess the estimated accuracy of such imputations per indicator. Additionally, (iv) we make the integrated and enriched data available both in a we browser interface and as machine-readable Linked Data, using standard vocabularies such as QB and PROV, and linking to e.g. DBpedia. Lastly, in an exhaustive evaluation of our approach, we compare our enrichment and cleansing techniques to a preliminary version of the Open City Data Pipeline presented at ISWC2015: firstly, we demonstrate that the combination of equational knowledge and standard machine learning techniques significantly helps to improve the quality of our missing value imputations; secondly, we arguable show that the more data we integrate, the more reliable our predictions become. Hence, over time, the Open City Data Pipeline shall provide a sustainable effort to serve Linked Data about cities in increasing quality.
Series: Working Papers on Information Systems, Information Business and Operations

APA, Harvard, Vancouver, ISO, and other styles

Pumpichet, Sitthapon. "Novel Online Data Cleaning Protocols for Data Streams in Trajectory, Wireless Sensor Networks." FIU Digital Commons, 2013. http://digitalcommons.fiu.edu/etd/1004.

Full text

Abstract:

The promise of Wireless Sensor Networks (WSNs) is the autonomous collaboration of a collection of sensors to accomplish some specific goals which a single sensor cannot offer. Basically, sensor networking serves a range of applications by providing the raw data as fundamentals for further analyses and actions. The imprecision of the collected data could tremendously mislead the decision-making process of sensor-based applications, resulting in an ineffectiveness or failure of the application objectives. Due to inherent WSN characteristics normally spoiling the raw sensor readings, many research efforts attempt to improve the accuracy of the corrupted or “dirty” sensor data. The dirty data need to be cleaned or corrected. However, the developed data cleaning solutions restrict themselves to the scope of static WSNs where deployed sensors would rarely move during the operation. Nowadays, many emerging applications relying on WSNs need the sensor mobility to enhance the application efficiency and usage flexibility. The location of deployed sensors needs to be dynamic. Also, each sensor would independently function and contribute its resources. Sensors equipped with vehicles for monitoring the traffic condition could be depicted as one of the prospective examples. The sensor mobility causes a transient in network topology and correlation among sensor streams. Based on static relationships among sensors, the existing methods for cleaning sensor data in static WSNs are invalid in such mobile scenarios. Therefore, a solution of data cleaning that considers the sensor movements is actively needed. This dissertation aims to improve the quality of sensor data by considering the consequences of various trajectory relationships of autonomous mobile sensors in the system. First of all, we address the dynamic network topology due to sensor mobility. The concept of virtual sensor is presented and used for spatio-temporal selection of neighboring sensors to help in cleaning sensor data streams. This method is one of the first methods to clean data in mobile sensor environments. We also study the mobility pattern of moving sensors relative to boundaries of sub-areas of interest. We developed a belief-based analysis to determine the reliable sets of neighboring sensors to improve the cleaning performance, especially when node density is relatively low. Finally, we design a novel sketch-based technique to clean data from internal sensors where spatio-temporal relationships among sensors cannot lead to the data correlations among sensor streams.

APA, Harvard, Vancouver, ISO, and other styles

Artilheiro, Fernando Manuel Freitas. "Analysis and procedures of multibeam data cleaning for bathymetric charting." Thesis, National Library of Canada = Bibliothèque nationale du Canada, 1996. http://www.collectionscanada.ca/obj/s4/f2/dsk2/ftp04/mq23776.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Ramakrishnan, Ranjani. "A data cleaning and annotation framework for genome-wide studies." Full text open access at:, 2007. http://content.ohsu.edu/u?/etd,263.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Hallström, Fredrik, and David Adolfsson. "Data Cleaning Extension on IoT Gateway : An Extended ThingsBoard Gateway." Thesis, Karlstads universitet, Institutionen för matematik och datavetenskap (from 2013), 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:kau:diva-84376.

Full text

Abstract:

Machine learning algorithms that run on Internet of Things sensory data requires high data quality to produce relevant output. By providing data cleaning at the edge, cloud infrastructures performing AI computations is relieved by not having to perform preprocessing. The main problem connected with edge cleaning is the dependency on unsupervised pre-processing as it leaves no guarantee of high quality output data. In this thesis an IoT gateway is extended to provide cleaning and live configuration of cleaning parameters before forwarding the data to a server cluster. Live configuration is implemented to be able to fit the parameters to match a time series and thereby mitigate quality issues. The gateway framework performance and used resources of the container was benchmarked using an MQTT stress tester. The gateway’s performance was under expectation. With high-frequency data streams, the throughput was below50%. However, these issues are not present for its Glava Energy Center connector, as their sensory data generates at a slower pace.
AI4ENERGY

APA, Harvard, Vancouver, ISO, and other styles

More sources

Books on the topic "Data cleaning"

Ganti, Venkatesh, and Anish Das Sarma. Data Cleaning. Cham: Springer International Publishing, 2013. http://dx.doi.org/10.1007/978-3-031-01897-8.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Exploratory data mining and data cleaning. Hoboken, NJ: John Wiley & Sons, 2004.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

Best practices in data cleaning. Thousand Oaks: SAGE, 2013.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

Institute, SAS, ed. Cody's data cleaning techniques using SAS. 2nd ed. Cary, NC: SAS Institute Inc., 2008.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

de Jonge, Edwin, and Mark van der Loo. Statistical Data Cleaning with Applications in R. Chichester, UK: John Wiley & Sons, Ltd, 2018. http://dx.doi.org/10.1002/9781118897126.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Institute, SAS, ed. Cody's data cleaning techniques using SAS software. Cary, NC: SAS Institute Inc., 1999.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

Buttrey, Samuel. A Data Scientist's Guide to Acquiring, Cleaning and Managing Data in R. Chichester, UK: John Wiley & Sons Ltd, 2017. http://dx.doi.org/10.1002/9781119080053.

Full text

APA, Harvard, Vancouver, ISO, and other styles

United States. Environmental Protection Agency. Office of Water Regulations and Standards, ed. Preliminary data summary for the transportation equipment cleaning industry. Washington, D.C: Office of Water Regulations and Standards, Office of Water, U.S. Environmental Protection Agency, 1993.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

Gibbs, Roger. A review of the data available on cleaning services. [London?: Department of Trade and Industry?], 1987.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

Kimball, Ralph. The data warehouse ETL toolkit: Practical techniques for extracting, cleaning, conforming, and delivering data. Indianapolis, IN: Wiley, 2004.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

More sources

Book chapters on the topic "Data cleaning"

Sunne, Samantha. "Cleaning Data." In Data + Journalism, 71–85. New York: Routledge, 2022. http://dx.doi.org/10.4324/9781003273301-5.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Ganti, Venkatesh, and Anish Das Sarma. "Task: Deduplication." In Data Cleaning, 49–55. Cham: Springer International Publishing, 2013. http://dx.doi.org/10.1007/978-3-031-01897-8_8.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Ganti, Venkatesh, and Anish Das Sarma. "Operator: Clustering." In Data Cleaning, 29–34. Cham: Springer International Publishing, 2013. http://dx.doi.org/10.1007/978-3-031-01897-8_5.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Ganti, Venkatesh, and Anish Das Sarma. "Similarity Functions." In Data Cleaning, 13–16. Cham: Springer International Publishing, 2013. http://dx.doi.org/10.1007/978-3-031-01897-8_3.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Ganti, Venkatesh, and Anish Das Sarma. "Introduction." In Data Cleaning, 1–6. Cham: Springer International Publishing, 2013. http://dx.doi.org/10.1007/978-3-031-01897-8_1.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Ganti, Venkatesh. "Data Cleaning." In Encyclopedia of Database Systems, 1–4. New York, NY: Springer New York, 2016. http://dx.doi.org/10.1007/978-1-4899-7993-3_592-2.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Ganti, Venkatesh. "Data Cleaning." In Encyclopedia of Database Systems, 561–64. Boston, MA: Springer US, 2009. http://dx.doi.org/10.1007/978-0-387-39940-9_592.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Whitmore, Nathan. "Data cleaning." In R for Conservation and Development Projects, 125–44. First edition. | Boca Raton: CRC Press, 2021. | Series: Chapman & Hall the R series: Chapman and Hall/CRC, 2020. http://dx.doi.org/10.1201/9780429262180-ch10.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Winson-Geideman, Kimberly, Andy Krause, Clifford A. Lipscomb, and Nicholas Evangelopoulos. "Data cleaning." In Real Estate Analysis in the Information Age, 86–100. Abingdon, Oxon ; New York, NY : Routledge, 2018.: Routledge, 2017. http://dx.doi.org/10.4324/9781315311135-9.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Chu, Xu. "Data Cleaning." In Encyclopedia of Big Data Technologies, 1–7. Cham: Springer International Publishing, 2018. http://dx.doi.org/10.1007/978-3-319-63962-8_3-1.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Conference papers on the topic "Data cleaning"

Chu, Xu, Ihab F. Ilyas, Sanjay Krishnan, and Jiannan Wang. "Data Cleaning." In SIGMOD/PODS'16: International Conference on Management of Data. New York, NY, USA: ACM, 2016. http://dx.doi.org/10.1145/2882903.2912574.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Alipour-Langouri, Morteza, Zheng Zheng, Fei Chiang, Lukasz Golab, and Jaroslaw Szlichta. "Contextual Data Cleaning." In 2018 IEEE 34th International Conference on Data Engineering Workshops (ICDEW). IEEE, 2018. http://dx.doi.org/10.1109/icdew.2018.00010.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Tang, Jie, Hang Li, Yunbo Cao, and Zhaohui Tang. "Email data cleaning." In Proceeding of the eleventh ACM SIGKDD international conference. New York, New York, USA: ACM Press, 2005. http://dx.doi.org/10.1145/1081870.1081926.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Volkovs, Maksims, Fei Chiang, Jaroslaw Szlichta, and Renee J. Miller. "Continuous data cleaning." In 2014 IEEE 30th International Conference on Data Engineering (ICDE). IEEE, 2014. http://dx.doi.org/10.1109/icde.2014.6816655.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Zhang, Aoqian, Shaoxu Song, and Jianmin Wang. "Sequential Data Cleaning." In SIGMOD/PODS'16: International Conference on Management of Data. New York, NY, USA: ACM, 2016. http://dx.doi.org/10.1145/2882903.2915233.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Johnson, Theodore, and Tamraparni Dasu. "Data quality and data cleaning." In the 2003 ACM SIGMOD international conference on. New York, New York, USA: ACM Press, 2003. http://dx.doi.org/10.1145/872757.872875.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Parulian, Nikolaus N., and Bertram Ludascher. "Towards Transparent Data Cleaning: The Data Cleaning Model Explorer (DCM/X)." In 2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL). IEEE, 2021. http://dx.doi.org/10.1109/jcdl52503.2021.00054.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Tang, Nan. "Big RDF data cleaning." In 2015 31st IEEE International Conference on Data Engineering Workshops (ICDEW). IEEE, 2015. http://dx.doi.org/10.1109/icdew.2015.7129549.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Musleh, Mashaal, Mourad Ouzzani, Nan Tang, and AnHai Doan. "CoClean: Collaborative Data Cleaning." In SIGMOD/PODS '20: International Conference on Management of Data. New York, NY, USA: ACM, 2020. http://dx.doi.org/10.1145/3318464.3384698.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Hua, Ming, and Jian Pei. "Cleaning disguised missing data." In the 13th ACM SIGKDD international conference. New York, New York, USA: ACM Press, 2007. http://dx.doi.org/10.1145/1281192.1281294.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Reports on the topic "Data cleaning"

Research Institute (IFPRI), International Food Policy. A guide to data cleaning using Stata. Washington, DC: International Food Policy Research Institute, 2018. http://dx.doi.org/10.2499/1024320680.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Bollinger, Christopher, and Amitabh Chandra. Iatrogenic Specification Error: A Cautionary Tale of Cleaning Data. Cambridge, MA: National Bureau of Economic Research, March 2003. http://dx.doi.org/10.3386/t0289.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Marinshaw, Richard J., and Hazem Qawasmeh. Characterizing Water Use at Mosques in Abu Dhabi. RTI Press, April 2020. http://dx.doi.org/10.3768/rtipress.2020.mr.0042.2004.

Full text

Abstract:

In areas where Muslims constitute much of the population, mosques can account for a significant portion of overall water consumption. Among the various uses of water at mosques, ablution (i.e., ritual cleansing) is generally assumed to be the largest, by far. As part of an initiative to reduce water consumption at mosques in Abu Dhabi, we collected data on ablution and other end uses for water from hundreds of mosques in and around Abu Dhabi City. This paper takes a closer look at how water is used at mosques in Abu Dhabi and presents a set of water use profiles that provide a breakdown of mosque water consumption by end use. The results of this research indicate that cleaning the mosque (primarily the floors) and some of the other non-ablution end uses at mosques can account for a significant portion of the total water consumption and significantly more than was anticipated or has been found in other countries.

APA, Harvard, Vancouver, ISO, and other styles

Rodríguez, Francisco. Cleaning Up the Kitchen Sink: On the Consequences of the Linearity Assumption for Cross-Country Growth Empirics. Inter-American Development Bank, January 2006. http://dx.doi.org/10.18235/0011322.

Full text

Abstract:

Existing work in growth empirics either assumes linearity of the growth function or attempts to capture non-linearities by the addition of a small number of quadratic or multiplicative interaction terms. Under a more generalized failure of linearity or if the functional form taken by the non-linearity is not known ex ante, such an approach is inadequate and will lead to biased and inconsistent OLS and instrumental variables estimators. This paper uses non-parametric and semiparametric methods of estimation to evaluate the relevance of strong non-linearities in commonly used growth data sets. Our tests decisively reject the linearity hypothesis. A preponderance of our tests also rejects the hypothesis that growth is a separable function of its regressors. Absent separability, the approximation error of estimators of the growth function grows in proportion to the number of relevant dimensions, substantially increasing the data requirements necessary to make inferences about the growth effects of regressors. We show that appropriate non-parametric tests are commonly inconclusive as to the effects of policies, institutions and economic structure on growth.

APA, Harvard, Vancouver, ISO, and other styles

Gazit, Nadav, Katherine Hade, Suzanne Macey, and Stefanie Siller. Modeling Suitable Habitat for a Species of Conservation Concern: An Introduction to Spatial Analysis with QGIS. American Museum of Natural History, 2020. http://dx.doi.org/10.5531/cbc.ncep.0068.

Full text

Abstract:

Spatial analysis has become a central practice in the field of conservation, allowing scientists to model and explore geographic questions on biodiversity and ecological systems. GIS (Geographic Information System) is an important integrative tool for mapping, analyzing, and creating data for spatial analyses. In this exercise, students use QGIS, an open-source GIS program, to model suitable habitat for a cryptic mammal species. The exercise guides students through the process of: 1) organizing, cleaning, and clipping vector and raster data within QGIS; 2) analyzing climate, habitat, and additional geographic data along with species occurrence data; and 3) developing a map of suitable projected habitat for the species of interest. Students then apply their analyses to critically consider the implications for surveying and conservation action.

APA, Harvard, Vancouver, ISO, and other styles

Darling, Arthur H., Diego J. Rodríguez, and William J. Vaughan. Uncertainty in the Economic Appraisal of Water Quality Improvement Investments: The Case for Project Risk Analysis. Inter-American Development Bank, July 2000. http://dx.doi.org/10.18235/0008825.

Full text

Abstract:

This technical paper argues that Monte Carlo risk analysis offers a more comprehensive and informative way to look at project risk ex-ante than the traditional (and often arbitrary), one-influence-at-atime sensitivity analysis approach customarily used in IDB analyses of economic feasibility. The case for probabilistic risk analysis is made using data from a project for cleaning up the Tietê river in São Paulo, Brazil. A number of ways to handle uncertainty about benefits are proposed, and their implications for the project acceptance decision and the consequent degree of presumed project risk are explained and illustrated.

APA, Harvard, Vancouver, ISO, and other styles

Beal, Daniel. ESM Research: From Design and Analysis to Publication. Instats Inc., 2022. http://dx.doi.org/10.61700/cldz810mwahip469.

Full text

Abstract:

This seminar introduces the use of Experience Sampling Methods (intensive longitudinal methods, ecological momentary assessment, diary methods, ambulatory assessment) to examine organizational phenomena. The first set of topics include development of ESM designs and measures, challenges with publishing ESM studies (with a particular emphasis on organizational journals), and an overview of tools used for ESM data collection and cleaning in R. The second set of topics focus on issues of ESM data analysis, including basic within- and between-person descriptive statistics and multilevel omega reliability from a Multilevel SEM perspective, and then more advanced Dynamic SEM techniques for causal inference in Mplus. An official Instats certificate of completion is provided at the conclusion of the seminar. For European PhD students, the seminar offers 2 ECTS Equivalent point.

APA, Harvard, Vancouver, ISO, and other styles

Jalkanen, Jukka-Pekka, Erik Fridell, Jaakko Kukkonen, Jana Moldanova, Leonidas Ntziachristos, Achilleas Grigoriadis, Maria Moustaka, et al. Environmental impacts of exhaust gas cleaning systems in the Baltic Sea, North Sea, and the Mediterranean Sea area. Finnish Meteorological Institute, 2024. http://dx.doi.org/10.35614/isbn.9789523361898.

Full text

Abstract:

Description: Shipping is responsible for a range of different pressures affecting air quality, climate, and the marine environment. Most social and economic analyses of shipping have focused on air pollution assessment and how shipping may impact climate change and human health. This risks that policies may be biased towards air pollution and climate change, whilst impacts on the marine environment are not as well known. One example is the sulfur regulation introduced in January 2020, which requires shipowners to use a compliant fuel with a sulfur content of 0.5% (0.1% in SECA regions) or use alternative compliance options (Exhaust Gas Cleaning Systems, EGCS) that are effective in reducing sulfur oxide (SOx) emissions to the atmosphere. The EGCS cleaning process results in large volumes of discharged water that includes a wide range of contaminants. Although regulations target SOx removal, other pollutants such as polycyclic aromatic hydrocarbons (PAHs), metals and combustion particles are removed from the exhaust to the wash water and subsequently discharged to the marine environment. Based on dilution series of the Whole Effluent Testing (WET), the impact of the EGCS effluent on marine invertebrate species and on phytoplankton was found to vary between taxonomic groups, and between different stages of the invertebrate life cycle. Invertebrates were more affected than phytoplankton, and the most sensitive endpoint detected in the present project was the fertilisation of sea urchin eggs, which were negatively affected at a sample dilution of 1 : 1,000,000. Dilutions of 1: 100,000 were harmful to early development of several of the tested species, including mussels, polychaetes, and crustaceans. The observed effects at these low concentrations of EGCS effluent were reduced egg production, and deformations and abnormal development of the larvae of the species. The ecotoxicological data produced in the EMERGE project were used to derive Predicted No Effect Concentration values. Corresponding modelling studies revealed that the EGCS effluent can be considered as a single entity for 2-10 days from the time of discharge, depending on the environmental conditions like sea currents, winds, and temperature. Area 10-30 km outside the shipping lanes will be prone to contaminant concentrations corresponding to 1 : 1,000,000 dilution which was deemed harmful for most sensitive endpoints of WET experiments. Studies for the Saronikos Gulf (Aegean Sea) revealed that the EGCS effluent dilution rate exceeded the 1 : 1,000,000 ratio 70% of the time at a distance of about 10 km from the port. This was also observed for 15% of the time within a band of 10 km wide along the shipping lane extending 500 km away from the port of Piraeus. When mortality of adult specimens of one of the species (copepod Acartia tonsa) was used as an endpoint it was found to be 3-4 orders of magnitude less sensitive to EGCS effluent than early life stage endpoints like fertilisation of eggs and larval development. Mortality of Acartia tonsa is commonly used in standard protocols for ecotoxicological studies, but our data hence shows that it seriously underestimates the ecologically relevant toxicity of the effluent. The same is true for two other commonly used and recommended endpoints, phytoplankton growth and inhibition of bioluminescence in marine bacteria. Significant toxic effects were reached only after addition of 20-40% effluent. A marine environmental risk assessment was performed for the Öresund region for baseline year 2018, where Predicted Environmental Concentrations (PECs) of open loop effluent discharge water were compared to the PNEC value. The results showed modelled concentrations of open loop effluent in large areas to be two to three orders of magnitude higher than the derived PNEC value, yielding a Risk Characterisation Ratio of 500-5000, which indicates significant environmental risk. Further, it should be noted that between 2018-2022 the number of EGCS vessels more than quadrupled in the area from 178 to 781. In this work, the EGCS discharges of the fleet in the Baltic Sea, North Sea, the English Channel, and the Mediterranean Sea area were studied in detail. The assessments of impacts described in this document were performed using a baseline year 2018 and future scenarios. These were made for the year 2050, based on different projections of transport volumes, also considering the fuel efficiency requirements and ship size developments. From the eight scenarios developed, two extremes were chosen for impact studies which illustrate the differences between a very high EGCS usage and a future without the need for EGCS while still compliant to IMO initial GHG strategy. The scenario without EGCS leads to 50% reduction of GHG emissions using low sulfur fuels, LNG, and methanol. For the high EGCS adoption scenario in 2050, about a third of the fleet sailing the studied sea areas would use EGCS and effluent discharge volumes would be increased tenfold for the Baltic Sea and hundredfold for the Mediterranean Sea when compared to 2018 baseline discharges. Some of the tested species, mainly the copepods, have a central position in pelagic food webs as they feed on phytoplankton and are themselves the main staple food for most fish larvae and for some species of adult fish, e.g., herring. The direct effect of the EGSE on invertebrates will therefore have an important indirect effect on the fish feeding on them. Effects are greatest in and near shipping lanes. Many important shipping lanes run close to shore and archipelago areas, and this also puts the sensitive shallow water coastal ecosystems at risk. It should be noted that no studies on sub-lethal effects of early 19 life stages in fish were included in the EMERGE project, nor are there any available data on this in the scientific literature. The direct toxic effects on fish at the expected concentrations of EGCS effluent are therefore largely unknown. According to the regional modelling studies, some of the contaminants will end up in sediments along the coastlines and archipelagos. The documentation of the complex chemical composition of EGCS effluent is in sharp contrast to the present legislation on threshold levels for content in EGCS effluent discharged from ships, which includes but a few PAHs, pH, and turbidity. Traditional assessments of PAHs in environmental and marine samples focus only on the U.S. Environmental Protection Agency (EPA) list of 16 priority PAHs, which includes only parent PAHs. Considering the complex PAHs assemblages and the importance of other related compounds, it is important to extend the EPA list to include alkyl-PAHs to obtain a representative monitoring of EGCS effluent and to assess the impact of its discharges into the marine environment. An economic evaluation of the installation and operational costs of EGCS was conducted noting the historical fuel price differences of high and low sulfur fuels. Equipment types, installation dates and annual fuel consumption from global simulations indicated that 51% of the global EGCS fleet had already reached break-even by the end of 2022, resulting in a summarised profit of 4.7 billion €2019. Within five years after the initial installation, more than 95% of the ships with open loop EGCS reach break-even. The pollutant loads from shipping come both through atmospheric deposition and direct discharges. This underlines the need of minimising the release of contaminants by using fuels which reduce the air emissions of harmful components without creating new pollution loads through discharges. Continued use of EGCS and high sulfur fossil fuels will delay the transition to more sustainable options. The investments made on EGCS enable ships to continue using fossil fuels instead of transitioning away from them as soon as possible as agreed in the 2023 Dubai Climate Change conference. Continued carriage of residual fuels also increases the risk of dire environmental consequences whenever accidental releases of oil to the sea occur.

APA, Harvard, Vancouver, ISO, and other styles

Martin, Mark, Lance Vowell, Ian King, and Chris Augustus. Automated Data Cleansing in Data Harvesting and Data Migration. Office of Scientific and Technical Information (OSTI), March 2011. http://dx.doi.org/10.2172/949761.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Adjaye-Gbewonyo, Dzifa, and Lindsey Back. Dental Care Utilization Among Children Aged 1–17 Years: United States, 2019 and 2020. National Center for Health Statistics (U.S.), December 2021. http://dx.doi.org/10.15620/cdc:111175.

Full text

Abstract:

This report uses data from the 2019 and 2020 National Health Interview Survey (NHIS) to describe recent changes in the prevalence of dental examinations or cleanings in the past 12 months among children aged 1–17 years by selected sociodemographic characteristics.

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

Academic literature on the topic 'Data cleaning'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Contents

Journal articles on the topic "Data cleaning"

Dissertations / Theses on the topic "Data cleaning"

Books on the topic "Data cleaning"

Book chapters on the topic "Data cleaning"

Conference papers on the topic "Data cleaning"

Reports on the topic "Data cleaning"