Dissertations / Theses on the topic 'Cleaning of data'
Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles
Consult the top 50 dissertations / theses for your research on the topic 'Cleaning of data.'
Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.
You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.
Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.
Li, Lin. "Data quality and data cleaning in database applications." Thesis, Edinburgh Napier University, 2012. http://researchrepository.napier.ac.uk/Output/5788.
Full textLiebchen, Gernot Armin. "Data cleaning techniques for software engineering data sets." Thesis, Brunel University, 2010. http://bura.brunel.ac.uk/handle/2438/5951.
Full textIyer, Vasanth. "Ensemble Stream Model for Data-Cleaning in Sensor Networks." FIU Digital Commons, 2013. http://digitalcommons.fiu.edu/etd/973.
Full textKokkonen, H. (Henna). "Effects of data cleaning on machine learning model performance." Bachelor's thesis, University of Oulu, 2019. http://jultika.oulu.fi/Record/nbnfioulu-201911133081.
Full textJia, Xibei. "From relations to XML : cleaning, integrating and securing data." Thesis, University of Edinburgh, 2008. http://hdl.handle.net/1842/3161.
Full textBischof, Stefan, Benedikt Kämpgen, Andreas Harth, Axel Polleres, and Patrik Schneider. "Open City Data Pipeline." Department für Informationsverarbeitung und Prozessmanagement, WU Vienna University of Economics and Business, 2017. http://epub.wu.ac.at/5438/1/city%2Dqb%2Dpaper.pdf.
Full textSeries: Working Papers on Information Systems, Information Business and Operations
Pumpichet, Sitthapon. "Novel Online Data Cleaning Protocols for Data Streams in Trajectory, Wireless Sensor Networks." FIU Digital Commons, 2013. http://digitalcommons.fiu.edu/etd/1004.
Full textArtilheiro, Fernando Manuel Freitas. "Analysis and procedures of multibeam data cleaning for bathymetric charting." Thesis, National Library of Canada = Bibliothèque nationale du Canada, 1996. http://www.collectionscanada.ca/obj/s4/f2/dsk2/ftp04/mq23776.pdf.
Full textRamakrishnan, Ranjani. "A data cleaning and annotation framework for genome-wide studies." Full text open access at:, 2007. http://content.ohsu.edu/u?/etd,263.
Full textHallström, Fredrik, and David Adolfsson. "Data Cleaning Extension on IoT Gateway : An Extended ThingsBoard Gateway." Thesis, Karlstads universitet, Institutionen för matematik och datavetenskap (from 2013), 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:kau:diva-84376.
Full textAI4ENERGY
Bischof, Stefan, Andreas Harth, Benedikt Kämpgen, Axel Polleres, and Patrik Schneider. "Enriching integrated statistical open city data by combining equational knowledge and missing value imputation." Elsevier, 2017. http://dx.doi.org/10.1016/j.websem.2017.09.003.
Full textBakhtiar, Qutub A. "Mitigating Inconsistencies by Coupling Data Cleaning, Filtering, and Contextual Data Validation in Wireless Sensor Networks." FIU Digital Commons, 2009. http://digitalcommons.fiu.edu/etd/99.
Full textLew, Alexander K. "PClean : Bayesian data cleaning at scale with domain-specific probabilistic programming." Thesis, Massachusetts Institute of Technology, 2020. https://hdl.handle.net/1721.1/130607.
Full textCataloged from the official PDF version of thesis.
Includes bibliographical references (pages 89-93).
Data cleaning is naturally framed as probabilistic inference in a generative model, combining a prior distribution over ground-truth databases with a likelihood that models the noisy channel by which the data are filtered, corrupted, and joined to yield incomplete, dirty, and denormalized datasets. Based on this view, this thesis presents PClean, a unified generative modeling architecture for cleaning and normalizing dirty data in diverse domains. Given an unclean dataset and a probabilistic program encoding relevant domain knowledge, PClean learns a structured representation of the data as a relational database of interrelated objects, and uses this latent structure to impute missing values, identify duplicates, detect errors, and propose corrections in the original data table. PClean makes three modeling and inference contributions: (i) a domain-general non-parametric generative model of relational data, for inferring latent objects and their network of latent connections; (ii) a domain-specific probabilistic programming language, for encoding domain knowledge specific to each dataset being cleaned; and (iii) a domain-general inference engine that adapts to each PClean program by constructing data-driven proposals used in sequential Monte Carlo and particle Gibbs. This thesis shows empirically that short (< 50-line) PClean programs deliver higher accuracy than state-of-the-art data cleaning systems based on machine learning and weighted logic; that PClean's inference algorithm is faster than generic particle Gibbs inference for probabilistic programs; and that PClean scales to large real-world datasets with millions of rows.
by Alexander K. Lew.
S.M.
S.M. Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science
Carreira, Paulo J. F. "Mapper: An Efficient Data Transformation Operator." Doctoral thesis, Department of Informatics, University of Lisbon, 2008. http://hdl.handle.net/10451/14295.
Full textBourennani, Farid. "Integration of heterogeneous data types using self organizing maps." Thesis, UOIT, 2009. http://hdl.handle.net/10155/41.
Full textJardini, Toni [UNESP]. "Ambiente data cleaning: suporte extensível, semântico e automático para análise e transformação de dados." Universidade Estadual Paulista (UNESP), 2012. http://hdl.handle.net/11449/98702.
Full textUm dos grandes desa os e di culdades para se obter conhecimento de fontes de dados e garantir consistência e a não duplicidade das informações armazenadas. Diversas técnicas e algoritmos têm sido propostos para minimizar o custoso trabalho de permitir que os dados sejam analisados e corrigidos. Porém, ainda há outras vertentes essenciais para se obter sucesso no processo de limpeza de dados, e envolvem diversas areas tecnológicas: desempenho computacional, semântica e autonomia do processo. Diante desse cenário, foi desenvolvido um ambiente data cleaningque contempla uma coleção de ferramentas de suporte a análise e transformação de dados de forma automática, extensível, com suporte semântico e aprendizado, independente de idioma. O objetivo deste trabalho e propor um ambiente cujas contribuições cobrem problemas ainda pouco explorados pela comunidade científica area de limpeza de dados como semântica e autonomia na execução da limpeza e possui, dentre seus objetivos, diminuir a interação do usuário no processo de análise e correção de inconsistências e duplicidades. Dentre as contribuições do ambiente desenvolvido, a eficácia se mostras significativa, cobrindo aproximadamente 90% do total de inconsistências presentes na base de dados, com percentual de casos de falsos-positivos 0% sem necessidade da interação do usuário
One of the great challenges and di culties to obtain knowledge from data sources is to ensure consistency and non-duplication of stored data. Many techniques and algorithms have been proposed to minimize the hard work to allow data to be analyzed and corrected. However, there are still other essential aspects for the data cleaning process success which involve many technological areas: performance, semantic and process autonomy. Against this backdrop, an data cleaning environment has been developed which includes a collec-tion of tools for automatic data analysis and processing, extensible, with multi-language semantic and learning support. The objective of this work is to propose an environment whose contributions cover problems yet explored by data cleaning scienti c community as semantic and autonomy in data cleaning process and it has, among its objectives, to re-duce user interaction in the process of analyzing and correcting data inconsistencies and duplications. Among the contributions of the developed environment, e ciency is signi -cant exhibitions, covering approximately 90% of database inconsistencies, with the 0% of false positives cases without the user interaction need
Jardini, Toni. "Ambiente data cleaning : suporte extensível, semântico e automático para análise e transformação de dados /." São José do Rio Preto : [s.n.], 2012. http://hdl.handle.net/11449/98702.
Full textBanca: Nalvo Franco de Almeida Junior
Banca: José Márcio Machado
Resumo: Um dos grandes desa os e di culdades para se obter conhecimento de fontes de dados e garantir consistência e a não duplicidade das informações armazenadas. Diversas técnicas e algoritmos têm sido propostos para minimizar o custoso trabalho de permitir que os dados sejam analisados e corrigidos. Porém, ainda há outras vertentes essenciais para se obter sucesso no processo de limpeza de dados, e envolvem diversas areas tecnológicas: desempenho computacional, semântica e autonomia do processo. Diante desse cenário, foi desenvolvido um ambiente data cleaningque contempla uma coleção de ferramentas de suporte a análise e transformação de dados de forma automática, extensível, com suporte semântico e aprendizado, independente de idioma. O objetivo deste trabalho e propor um ambiente cujas contribuições cobrem problemas ainda pouco explorados pela comunidade científica area de limpeza de dados como semântica e autonomia na execução da limpeza e possui, dentre seus objetivos, diminuir a interação do usuário no processo de análise e correção de inconsistências e duplicidades. Dentre as contribuições do ambiente desenvolvido, a eficácia se mostras significativa, cobrindo aproximadamente 90% do total de inconsistências presentes na base de dados, com percentual de casos de falsos-positivos 0% sem necessidade da interação do usuário
Abstract: One of the great challenges and di culties to obtain knowledge from data sources is to ensure consistency and non-duplication of stored data. Many techniques and algorithms have been proposed to minimize the hard work to allow data to be analyzed and corrected. However, there are still other essential aspects for the data cleaning process success which involve many technological areas: performance, semantic and process autonomy. Against this backdrop, an data cleaning environment has been developed which includes a collec-tion of tools for automatic data analysis and processing, extensible, with multi-language semantic and learning support. The objective of this work is to propose an environment whose contributions cover problems yet explored by data cleaning scienti c community as semantic and autonomy in data cleaning process and it has, among its objectives, to re-duce user interaction in the process of analyzing and correcting data inconsistencies and duplications. Among the contributions of the developed environment, e ciency is signi -cant exhibitions, covering approximately 90% of database inconsistencies, with the 0% of false positives cases without the user interaction need
Mestre
Neelisetty, Srikanth. "Detector Diagnostics, Data Cleaning and Improved Single Loop Velocity Estimation from Conventional Loop Detectors." The Ohio State University, 2004. http://rave.ohiolink.edu/etdc/view?acc_num=osu1419350524.
Full textZelený, Pavel. "Řízení kvality dat v malých a středních firmách." Master's thesis, Vysoká škola ekonomická v Praze, 2010. http://www.nusl.cz/ntk/nusl-82036.
Full textCenonfolo, Filippo. "Signal cleaning techniques and anomaly detection algorithms for motorbike applications." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2021.
Find full textFeng, Yuan. "Improve Data Quality By Using Dependencies And Regular Expressions." Thesis, Mittuniversitetet, Avdelningen för informationssystem och -teknologi, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:miun:diva-35620.
Full textBelen, Rahime. "Detecting Disguised Missing Data." Master's thesis, METU, 2009. http://etd.lib.metu.edu.tr/upload/12610411/index.pdf.
Full textMahdavi, Lahijani Mohammad [Verfasser], Ziawasch [Akademischer Betreuer] Abedjan, Ziawasch [Gutachter] Abedjan, Wolfgang [Gutachter] Lehner, and Eugene [Gutachter] Wu. "Semi-supervised data cleaning / Mohammad Mahdavi Lahijani ; Gutachter: Ziawasch Abedjan, Wolfgang Lehner, Eugene Wu ; Betreuer: Ziawasch Abedjan." Berlin : Technische Universität Berlin, 2020. http://d-nb.info/1223023060/34.
Full textPecorella, Tommaso. "Progettazione ed implementazione di un data warehouse di supporto alla profilazione dei consumi energetici domestici." Bachelor's thesis, Alma Mater Studiorum - Università di Bologna, 2015. http://amslaurea.unibo.it/8355/.
Full textTian, Yongchao. "Accéler la préparation des données pour l'analyse du big data." Thesis, Paris, ENST, 2017. http://www.theses.fr/2017ENST0017/document.
Full textWe are living in a big data world, where data is being generated in high volume, high velocity and high variety. Big data brings enormous values and benefits, so that data analytics has become a critically important driver of business success across all sectors. However, if the data is not analyzed fast enough, the benefits of big data will be limited or even lost. Despite the existence of many modern large-scale data analysis systems, data preparation which is the most time-consuming process in data analytics has not received sufficient attention yet. In this thesis, we study the problem of how to accelerate data preparation for big data analytics. In particular, we focus on two major data preparation steps, data loading and data cleaning. As the first contribution of this thesis, we design DiNoDB, a SQL-on-Hadoop system which achieves interactive-speed query execution without requiring data loading. Modern applications involve heavy batch processing jobs over large volume of data and at the same time require efficient ad-hoc interactive analytics on temporary data generated in batch processing jobs. Existing solutions largely ignore the synergy between these two aspects, requiring to load the entire temporary dataset to achieve interactive queries. In contrast, DiNoDB avoids the expensive data loading and transformation phase. The key innovation of DiNoDB is to piggyback on the batch processing phase the creation of metadata, that DiNoDB exploits to expedite the interactive queries. The second contribution is a distributed stream data cleaning system, called Bleach. Existing scalable data cleaning approaches rely on batch processing to improve data quality, which are very time-consuming in nature. We target at stream data cleaning in which data is cleaned incrementally in real-time. Bleach is the first qualitative stream data cleaning system, which achieves both real-time violation detection and data repair on a dirty data stream. It relies on efficient, compact and distributed data structures to maintain the necessary state to clean data, and also supports rule dynamics. We demonstrate that the two resulting systems, DiNoDB and Bleach, both of which achieve excellent performance compared to state-of-the-art approaches in our experimental evaluations, and can help data scientists significantly reduce their time spent on data preparation
Ortona, Stefano. "Easing information extraction on the web through automated rules discovery." Thesis, University of Oxford, 2016. https://ora.ox.ac.uk/objects/uuid:a5a7a070-338a-4afc-8be5-a38b486cf526.
Full textNunes, Marcos Freitas. "Avaliação experimental de uma técnica de padronização de escores de similaridade." reponame:Biblioteca Digital de Teses e Dissertações da UFRGS, 2009. http://hdl.handle.net/10183/25494.
Full textWith the growth of the Web, the volume of information grew considerably over the past years, and consequently, the access to remote databases became easier, which allows the integration of distributed information. Usually, instances of the same object in the real world, originated from distinct databases, present differences in the representation of their values, which means that the same information can be represented in different ways. In this context, research on approximate matching using similarity functions arises. As a consequence, there is a need to understand the result of the functions and to select ideal thresholds. Also, when matching records, there is the problem of combining the similarity scores, since distinct functions have different distributions. With the purpose of overcoming this problem, a previous work developed a technique that standardizes the scores, by replacing the computed score by an adjusted score (computed through a training), which is more intuitive for the user and can be combined in the process of record matching. This work was developed by a Phd student from the UFRGS database research group, and is referred to as MeaningScore (DORNELES et al., 2007). The present work intends to study and perform an experimental evaluation of this technique. As the validation shows, it is possible to say that the usage of the MeaningScore approach is valid and return better results. In the process of record matching, where distinct similarity must be combined, the usage of the adjusted score produces results with higher quality.
Blackmore, Caitlin E. "The Effectiveness of Warnings at Reducing the Prevalence of Insufficient Effort Responding." Wright State University / OhioLINK, 2014. http://rave.ohiolink.edu/etdc/view?acc_num=wright1412080619.
Full textBoskovitz, Agnes, and abvi@webone com au. "Data Editing and Logic: The covering set method from the perspective of logic." The Australian National University. Research School of Information Sciences and Engineering, 2008. http://thesis.anu.edu.au./public/adt-ANU20080314.163155.
Full textLamer, Antoine. "Contribution à la prévention des risques liés à l’anesthésie par la valorisation des informations hospitalières au sein d’un entrepôt de données." Thesis, Lille 2, 2015. http://www.theses.fr/2015LIL2S021/document.
Full textIntroduction Hospital Information Systems (HIS) manage and register every day millions of data related to patient care: biological results, vital signs, drugs administrations, care process... These data are stored by operational applications provide remote access and a comprehensive picture of Electronic Health Record. These data may also be used to answer to others purposes as clinical research or public health, particularly when integrated in a data warehouse. Some studies highlighted a statistical link between the compliance of quality indicators related to anesthesia procedure and patient outcome during the hospital stay. In the University Hospital of Lille, the quality indicators, as well as the patient comorbidities during the post-operative period could be assessed with data collected by applications of the HIS. The main objective of the work is to integrate data collected by operational applications in order to realize clinical research studies.Methods First, the data quality of information registered by the operational applications is evaluated with methods … by the literature or developed in this work. Then, data quality problems highlighted by the evaluation are managed during the integration step of the ETL process. New data are computed and aggregated in order to dispose of indicators of quality of care. Finally, two studies bring out the usability of the system.Results Pertinent data from the HIS have been integrated in an anesthesia data warehouse. This system stores data about the hospital stay and interventions (drug administrations, vital signs …) since 2010. Aggregated data have been developed and used in two clinical research studies. The first study highlighted statistical link between the induction and patient outcome. The second study evaluated the compliance of quality indicators of ventilation and the impact on comorbity.Discussion The data warehouse and the cleaning and integration methods developed as part of this work allow performing statistical analysis on more than 200 000 interventions. This system can be implemented with other applications used in the CHRU of Lille but also with Anesthesia Information Management Systems used by other hospitals
Pabarškaitė, Židrina. "Enhancements of pre-processing, analysis and presentation techniques in web log mining." Doctoral thesis, Lithuanian Academic Libraries Network (LABT), 2009. http://vddb.library.lt/obj/LT-eLABa-0001:E.02~2009~D_20090713_142203-05841.
Full textInternetui skverbiantis į mūsų gyvenimą, vis didesnis dėmesys kreipiamas į informacijos pateikimo kokybę, bei į tai, kaip informacija yra pateikta. Disertacijos tyrimų sritis yra žiniatinklio serverių kaupiamų duomenų gavyba bei duomenų pateikimo galutiniam naudotojui gerinimo būdai. Tam reikalingos žinios išgaunamos iš žiniatinklio serverio žurnalo įrašų, kuriuose fiksuojama informacija apie išsiųstus vartotojams žiniatinklio puslapius. Darbo tyrimų objektas yra žiniatinklio įrašų gavyba, o su šiuo objektu susiję dalykai: žiniatinklio duomenų paruošimo etapų tobulinimas, žiniatinklio tekstų analizė, duomenų analizės algoritmai prognozavimo ir klasifikavimo uždaviniams spręsti. Pagrindinis disertacijos tikslas – perprasti svetainių naudotojų elgesio formas, tiriant žiniatinklio įrašus, tobulinti paruošimo, analizės ir rezultatų interpretavimo etapų metodologijas. Darbo tyrimai atskleidė naujas žiniatinklio duomenų analizės galimybes. Išsiaiškinta, kad internetinių duomenų – žiniatinklio įrašų švarinimui buvo skirtas nepakankamas dėmesys. Parodyta, kad sumažinus nereikšmingų įrašų kiekį, duomenų analizės procesas tampa efektyvesnis. Todėl buvo sukurtas naujas metodas, kurį pritaikius žinių pateikimas atitinka tikruosius vartotojų maršrutus. Tyrimo metu nustatyta, kad naudotojų naršymo istorija yra skirtingų ilgių, todėl atlikus specifinį duomenų paruošimą – suformavus fiksuoto ilgio vektorius, tikslinga taikyti iki šiol nenaudotus praktikoje sprendimų medžių algoritmus... [toliau žr. visą tekstą]
Pinha, André Teixeira. "Monitoramento de doadores de sangue através de integração de bases de texto heterogêneas." reponame:Repositório Institucional da UFABC, 2016.
Find full textDissertação (mestrado) - Universidade Federal do ABC, Programa de Pós-Graduação em Ciência da Computação, 2016.
Através do relacionamento probabilístico de bases de dados é possível obter informações que a análise individual ou manual de bases de dados não proporcionaria. Esse trabalho visa encontrar, através do relacionamento probabilístico de registros, doadores de sangue da base de dados da Fundação Pró-Sangue (FPS) no Sistema de Informações sobre Mortalidade (SIM), nos anos de 2001 a 2006, favorecendo assim a manutenção de hemoderivados da instituição, inferindo se determinado doador veio à óbito. Para tal, foram avaliadas a eficiência de diferentes chaves de blocking que foram aplicadas em um conjunto de softwares gratuitos de record linkage e no software implementado para uso específico do estudo, intitulado SortedLink. Nos estudos, os registros foram padronizados e apenas os que possuíam dados da mãe cadastrados foram utilizados. Para avaliar a eficiência das chaves de blocking, foram selecionados 100.000 registros aleatoriamente das bases de dados SIM e FPS, e adicionados 30 registros de validação para cada conjunto. Sendo que o software SortedLink, implementado no trabalho, foi o que apresentou os melhores resultados e foi utilizado para obter os resultados dos possíveis pares de registros na base total de dados, 1.709.819 de registros para o SIM e 334.077 para o FPS. Além disso, o estudo também avalia a eficiência dos algoritmos de codificação fonética SOUNDEX, tipicamente utilizado no processo de record linkage, e do BRSOUND, desenvolvido para codificação de nomes e sobrenomes oriundos da língua portuguesa do Brasil.
Through probabilistic record linkage of databases is possible to obtain information that the individual or manual analysis of databases do not provide. This work aims to find, through probabilistic record relationship, blood donors from the database of Fundação Pró-Sangue (FPS) in the Sistema de Informações sobre Mortalidade (SIM) from Brazil, in the year 2001 to 2006, thus favoring maintenance blood products of the institution, inferring whether a donor came to death. For this purpose, we evaluated the effectiveness of different blocking keys that were applied to a set of free software record linkage and a software implemented for specific use of the study, entitled SortedLink. In the studies, the records were standardized and only those who had registered mother information were used. To assess the effectiveness of blocking keys were selected randomly 100, 000 records of SIM and FPS databases, and added 30 validation records for each set. Since the SortedLink software, implemented in this work, showed the best results, it was used to obtain the results of the possible pairs of records in the total database, 1.709.819 records from SIM and 334.077 from FPS. In addition, the study also evaluated the efficiency of SOUNDEX phonetic encoding algorithms, typically used in the record linkage process and the BRSOUND, developed for encoding names and surnames derived from the Portuguese language of Brazil.
Andrade, Tiago Luís de [UNESP]. "Ambiente independente de idioma para suporte a identificação de tuplas duplicadas por meio da similaridade fonética e numérica: otimização de algoritmo baseado em multithreading." Universidade Estadual Paulista (UNESP), 2011. http://hdl.handle.net/11449/98678.
Full textCom o objetivo de garantir maior confiabilidade e consistência dos dados armazenados em banco de dados, a etapa de limpeza de dados está situada no início do processo de Descoberta de Conhecimento em Base de Dados (Knowledge Discovery in Database - KDD). Essa etapa tem relevância significativa, pois elimina problemas que refletem fortemente na confiabilidade do conhecimento extraído, como valores ausentes, valores nulos, tuplas duplicadas e valores fora do domínio. Trata-se de uma etapa importante que visa a correção e o ajuste dos dados para as etapas posteriores. Dentro dessa perspectiva, são apresentadas técnicas que buscam solucionar os diversos problemas mencionados. Diante disso, este trabalho tem como metodologia a caracterização da detecção de tuplas duplicadas em banco de dados, apresentação dos principais algoritmos baseados em métricas de distância, algumas ferramentas destinadas para tal atividade e o desenvolvimento de um algoritmo para identificação de registros duplicados baseado em similaridade fonética e numérica independente de idioma, desenvolvido por meio da funcionalidade multithreading para melhorar o desempenho em relação ao tempo de execução do algoritmo. Os testes realizados demonstram que o algoritmo proposto obteve melhores resultados na identificação de registros duplicados em relação aos algoritmos fonéticos existentes, fato este que garante uma melhor limpeza da base de dados
In order to ensure greater reliability and consistency of data stored in the database, the data cleaning stage is set early in the process of Knowledge Discovery in Database - KDD. This step has significant importance because it eliminates problems that strongly reflect the reliability of the knowledge extracted as missing values, null values, duplicate tuples and values outside the domain. It is an important step aimed at correction and adjustment for the subsequent stages. Within this perspective, techniques are presented that seek to address the various problems mentioned. Therefore, this work is the characterization method of detecting duplicate tuples in the database, presenting the main algorithms based on distance metrics, some tools designed for such activity and the development of an algorithm to identify duplicate records based on phonetic similarity numeric and language-independent, developed by multithreading functionality to improve performance over the runtime of the algorithm. Tests show that the proposed algorithm achieved better results in identifying duplicate records regarding phonetic algorithms exist, a fact that ensures better cleaning of the database
Vavruška, Marek. "Realised stochastic volatility in practice." Master's thesis, Vysoká škola ekonomická v Praze, 2012. http://www.nusl.cz/ntk/nusl-165381.
Full textZaidi, Houda. "Amélioration de la qualité des données : correction sémantique des anomalies inter-colonnes." Thesis, Paris, CNAM, 2017. http://www.theses.fr/2017CNAM1094/document.
Full textData quality represents a major challenge because the cost of anomalies can be very high especially for large databases in enterprises that need to exchange information between systems and integrate large amounts of data. Decision making using erroneous data has a bad influence on the activities of organizations. Quantity of data continues to increase as well as the risks of anomalies. The automatic correction of these anomalies is a topic that is becoming more important both in business and in the academic world. In this report, we propose an approach to better understand the semantics and the structure of the data. Our approach helps to correct automatically the intra-column anomalies and the inter-columns ones. We aim to improve the quality of data by processing the null values and the semantic dependencies between columns
Cugler, Daniel Cintra 1982. "Supporting the collection and curation of biological observation metadata = Apoio à coleta e curadoria de metadados de observações biológicas." [s.n.], 2014. http://repositorio.unicamp.br/jspui/handle/REPOSIP/275520.
Full textTese (doutorado) - Universidade Estadual de Campinas, Instituto de Computação
Made available in DSpace on 2018-08-25T17:19:53Z (GMT). No. of bitstreams: 1 Cugler_DanielCintra_D.pdf: 12940611 bytes, checksum: 857c7cd0b3ea3c5da4930823438c55fa (MD5) Previous issue date: 2014
Resumo: Bancos de dados de observações biológicas contêm informações sobre ocorrências de um organismo ou um conjunto de organismos detectados em um determinado local e data, de acordo com alguma metodologia. Tais bancos de dados armazenam uma variedade de dados, em múltiplas escalas espaciais e temporais, incluindo imagens, mapas, sons, textos, etc. Estas inestimáveis informações podem ser utilizadas em uma ampla gama de pesquisas, por exemplo, aquecimento global, comportamento de espécies ou produção de alimentos. Todos estes estudos são baseados na análise dos registros e seus respectivos metadados. Na maioria das vezes, análises são iniciadas nos metadados, estes frequentemente utilizados para indexar os registros de observações. No entanto, dada a natureza das atividades de observação, metadados podem possuir problemas de qualidade, dificultando tais análises. Por exemplo, podem haver lacunas nos metadados (por exemplo, atributos faltantes ou registros insuficientes). Isto pode causar sérios problemas: em estudos em biodiversidade, por exemplo, problemas nos metadados relacionados a uma única espécie podem afetar o entendimento não apenas da espécie, mas de amplas interações ecológicas. Esta tese propõe um conjunto de processos para auxiliar na solução de problemas de qualidade em metadados. Enquanto abordagens anteriores enfocam em um dado aspecto do problema, esta tese provê uma arquitetura e algoritmos que englobam o ciclo completo da gerência de metadados de observações biológicas, que vai desde adquirir dados até recuperar registros na base de dados. Nossas contribuições estão divididas em duas categorias: (a) enriquecimento de dados e (b) limpeza de dados. Contribuições na categoria (a) proveem informação adicional para ambos atributos faltantes em registros existentes e registros faltantes para requisitos específicos. Nossas estratégias usam fontes de dados remotas oficiais e VGI (Volunteered Geographic Information) para enriquecer tais metadados, provendo as informações faltantes. Contribuições na categoria (b) detectam anomalias em metadados de observações biológicas através da execução de análises espaciais que contrastam a localização das observações com mapas oficiais de distribuição geográfica de espécies. Deste modo, as principais contribuições são: (i) uma arquitetura para recuperação de registros de observações biológicas, que deriva atributos faltantes através do uso de fontes de dados externas; (ii) uma abordagem espacial para detecção de anomalias e (iii) uma abordagem para aquisição adaptativa de VGI para preencher lacunas em metadados, utilizando dispositivos móveis e sensores. Estas contribuições foram validadas através da implementação de protótipos, utilizando como estudo de caso os desafios oriundos do gerenciamento de metadados de observações biológicas da Fonoteca Neotropical Jacques Vielliard (FNJV), uma das 10 maiores coleções de sons de animais do mundo
Abstract: Biological observation databases contain information about the occurrence of an organism or set of organisms detected at a given place and time according to some methodology. Such databases store a variety of data, at multiple spatial and temporal scales, including images, maps, sounds, texts and so on. This priceless information can be used in a wide range of research initiatives, e.g., global warming, species behavior or food production. All such studies are based on analyzing the records themselves, and their metadata. Most times, analyses start from metadata, often used to index the observation records. However, given the nature of observation activities, metadata may suffer from quality problems, hampering such analyses. For example, there may be metadata gaps (e.g., missing attributes, or insufficient records). This can have serious effects: in biodiversity studies, for instance, metadata problems regarding a single species can affect the understanding not just of the species, but of wider ecological interactions. This thesis proposes a set of processes to help solve problems in metadata quality. While previous approaches concern one given aspect of the problem, the thesis provides an architecture and algorithms that encompass the whole cycle of managing biological observation metadata, which goes from acquiring data to retrieving database records. Our contributions are divided into two categories: (a) data enrichment and (b) data cleaning. Contributions in category (a) provide additional information for both missing attributes in existent records, and missing records for specific requirements. Our strategies use authoritative remote data sources and VGI (Volunteered Geographic Information) to enrich such metadata, providing missing information. Contributions in category (b) detect anomalies in biological observation metadata by performing spatial analyses that contrast location of the observations with authoritative geographic distribution maps. Thus, the main contributions are: (i) an architecture to retrieve biological observation records, which derives missing attributes by using external data sources; (ii) a geographical approach for anomaly detection and (iii) an approach for adaptive acquisition of VGI to fill out metadata gaps, using mobile devices and sensors. These contributions were validated by actual implementations, using as case study the challenges presented by the management of biological observation metadata of the Fonoteca Neotropical Jacques Vielliard (FNJV), one of the top 10 animal sound collections in the world
Doutorado
Ciência da Computação
Doutor em Ciência da Computação
Olejník, Tomáš. "Zpracování obchodních dat finančního trhu." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2011. http://www.nusl.cz/ntk/nusl-412828.
Full textNorng, Sorn. "Statistical decisions in optimising grain yield." Queensland University of Technology, 2004. http://eprints.qut.edu.au/15806/.
Full textAndrade, Tiago Luís de. "Ambiente independente de idioma para suporte a identificação de tuplas duplicadas por meio da similaridade fonética e numérica: otimização de algoritmo baseado em multithreading /." São José do Rio Preto : [s.n.], 2011. http://hdl.handle.net/11449/98678.
Full textAbstract: In order to ensure greater reliability and consistency of data stored in the database, the data cleaning stage is set early in the process of Knowledge Discovery in Database - KDD. This step has significant importance because it eliminates problems that strongly reflect the reliability of the knowledge extracted as missing values, null values, duplicate tuples and values outside the domain. It is an important step aimed at correction and adjustment for the subsequent stages. Within this perspective, techniques are presented that seek to address the various problems mentioned. Therefore, this work is the characterization method of detecting duplicate tuples in the database, presenting the main algorithms based on distance metrics, some tools designed for such activity and the development of an algorithm to identify duplicate records based on phonetic similarity numeric and language-independent, developed by multithreading functionality to improve performance over the runtime of the algorithm. Tests show that the proposed algorithm achieved better results in identifying duplicate records regarding phonetic algorithms exist, a fact that ensures better cleaning of the database
Orientador: Carlos Roberto Valêncio
Coorientador: Maurizio Babini
Banca: Pedro Luiz Pizzigatti Corrêa
Banca: José Márcio Machado
Mestre
Abraham, Lukáš. "Analýza dat síťové komunikace mobilních zařízení." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2020. http://www.nusl.cz/ntk/nusl-432938.
Full textGrillo, Aderibigbe. "Developing a data quality scorecard that measures data quality in a data warehouse." Thesis, Brunel University, 2018. http://bura.brunel.ac.uk/handle/2438/17137.
Full textRAPUR, NIHARIKA. "TREATMENT OF DATA WITH MISSING ELEMENTS IN PROCESS MODELLING." University of Cincinnati / OhioLINK, 2003. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1060192778.
Full textHeise, Arvid [Verfasser], and Felix [Akademischer Betreuer] Naumann. "Data cleansing and integration operators for a parallel data analytics platform / Arvid Heise ; Betreuer: Felix Naumann." Potsdam : Universität Potsdam, 2015. http://d-nb.info/1217717633/34.
Full textТодоріко, Ольга Олексіївна. "Моделі та методи очищення та інтеграції текстових даних в інформаційних системах." Thesis, Запорізький національний університет, 2016. http://repository.kpi.kharkov.ua/handle/KhPI-Press/21856.
Full textThe thesis for the candidate degree in technical sciences, speciality 05.13.06 – Information Technologies. – National Technical University "Kharkiv Polytechnic Institute", Kharkiv, 2016. In the thesis the actual scientific and practical problem of increasing the efficiency and quality of cleaning and integration of data in information reference system and information retrieval system is solved. The improvement of information technology of cleaning and integration of data is achieved by reduction of quantity of mistakes in text information by means of use of model of an inflectional paradigm, methods of creation of a lexeme index, advanced methods of tolerant retrieval. The developed model of an inflectional paradigm includes a representation of words as an ordered collection of signatures and an approximate measure of similarity between two representations. The model differs in method of dealing with forms of words and character positions. It provides the basis for the implementation of improved methods of tolerant retrieval, cleaning and integration of datasets. The method of creation of the lexeme index which is based on the offered model of an inflectional paradigm is developed, and it allows mapping a word and all its forms to a record of the index. The method of tolerant retrieval is improved at preliminary filtration stage thanks to the developed model of an inflectional paradigm and the lexeme index. The experimental efficiency evaluation indicates high precision and 99 0,5 % recall. The information technology of cleaning and integration of data is improved using the developed models and methods. The software which on the basis of the developed models and methods carries out tolerant retrieval, cleaning and integration of data sets was developed. Theoretical and practical results of the thesis are introduced in production of document flow of an entrance committee and educational process of mathematical faculty of the State institution of higher education "Zaporizhzhya National University".
Тодоріко, Ольга Олексіївна. "Моделі та методи очищення та інтеграції текстових даних в інформаційних системах." Thesis, НТУ "ХПІ", 2016. http://repository.kpi.kharkov.ua/handle/KhPI-Press/21853.
Full textThe thesis for the candidate degree in technical sciences, speciality 05.13.06 – Information Technologies. – National Technical University «Kharkiv Polytechnic Institute», Kharkiv, 2016. In the thesis the actual scientific and practical problem of increasing the efficiency and quality of cleaning and integration of data in information reference system and information retrieval system is solved. The improvement of information technology of cleaning and integration of data is achieved by reduction of quantity of mistakes in text information by means of use of model of an inflectional paradigm, methods of creation of a lexeme index, advanced methods of tolerant retrieval. The developed model of an inflectional paradigm includes a representation of words as an ordered collection of signatures and an approximate measure of similarity between two representations. The model differs in method of dealing with forms of words and character positions. It provides the basis for the implementation of improved methods of tolerant retrieval, cleaning and integration of datasets. The method of creation of the lexeme index which is based on the offered model of an inflectional paradigm is developed, and it allows mapping a word and all its forms to a record of the index. The method of tolerant retrieval is improved at preliminary filtration stage thanks to the developed model of an inflectional paradigm and the lexeme index. The experimental efficiency evaluation indicates high precision and 99 0,5 % recall. The information technology of cleaning and integration of data is improved using the developed models and methods. The software which on the basis of the developed models and methods carries out tolerant retrieval, cleaning and integration of data sets was developed. Theoretical and practical results of the thesis are introduced in production of document flow of an entrance committee and educational process of mathematical faculty of the State institution of higher education «Zaporizhzhya National University».
Smolík, Ondřej. "Datová kvalita, integrita a konsolidace dat v BI." Master's thesis, Vysoká škola ekonomická v Praze, 2008. http://www.nusl.cz/ntk/nusl-12350.
Full textBartoš, Jan. "Master Data Integration hub - řešení pro konsolidaci referenčních dat v podniku." Master's thesis, Vysoká škola ekonomická v Praze, 2011. http://www.nusl.cz/ntk/nusl-73508.
Full textRut, Lukáš. "Kvalita dat a efektivní využití rejstříků státní správy." Master's thesis, Vysoká škola ekonomická v Praze, 2009. http://www.nusl.cz/ntk/nusl-11562.
Full textMaršová, Eliška. "Predikce hodnot v čase." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2016. http://www.nusl.cz/ntk/nusl-255333.
Full textHenriksson, Erik, and Kristopher Werlinder. "Housing Price Prediction over Countrywide Data : A comparison of XGBoost and Random Forest regressor models." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-302535.
Full textMålet med den här studien är att jämföra och undersöka hur en XGBoost regressor och en Random Forest regressor presterar i att förutsäga huspriser. Detta görs med hjälp av två stycken datauppsättningar. Jämförelsen tar hänsyn till modellernas träningstid, slutledningstid och de tre utvärderingsfaktorerna R2, RMSE and MAPE. Datauppsättningarna beskrivs i detalj tillsammans med en bakgrund om regressionsmodellerna. Metoden innefattar en rengöring av datauppsättningarna, sökande efter optimala hyperparametrar för modellerna och 5delad korsvalidering för att uppnå goda förutsägelser. Resultatet av studien är att XGBoost regressorn presterar bättre på både små och stora datauppsättningar, men att den är överlägsen när det gäller stora datauppsättningar. Medan Random Forest modellen kan uppnå liknande resultat som XGBoost modellen, tar träningstiden mellan 250 gånger så lång tid och modellen får en cirka 40 gånger längre slutledningstid. Detta gör att XGBoost är särskilt överlägsen vid användning av stora datauppsättningar.