Dissertations / Theses: 'Data'

1

Riminucci, Stefania. "COVID-19,Open data e data visualization:interazione con dati epidemiologici." Bachelor's thesis, Alma Mater Studiorum - Università di Bologna, 2020. http://amslaurea.unibo.it/21577/.

Full text

Abstract:

L’obiettivo di questa tesi è quello di analizzare l’efficacia di diverse strategie di data visualization utilizzate per presentare open data in forma grafica alla popolazione. Come caso di studio, si è presa in considerazione la pandemia di COVID-19, e le molteplici visualizzazioni che sfruttano gli open data messi a disposizione dalle comunità scientifiche, offrendo informazioni sull'evoluzione dei contagi a livello nazionale e internazionale. Per valutare l’efficacia delle diverse visualizzazioni, è stato sviluppato e proposto al pubblico un questionario per la raccolta di dati per avere una percezione di quali siano i livelli di comprensione ed utilizzo di varie tipologie di grafici e dashboard messi a disposizione dalle diverse piattaforme online. Il questionario prevedeva sia proposte di grafici da valutare che azioni richieste agli utenti per la ricerca di informazioni su piattaforme esterne. 99 utenti hanno risposto al questionario. Analizzando i dati raccolti è emerso che i risultati relativi ai grafici proposti non hanno mostrato una netta predominanza di alcuna delle proposte presentate, fornendo solamente qualche indicazione relativamente alla preferenza di istogrammi e cartogrammi rispetto ad altre tipologie di grafici. Allo stesso modo, l’analisi sui dati relativi alla facilità di reperire le informazioni sulle diverse piattaforme esterne non ha restituito risultati rilevanti, enfatizzando l’impatto della componente soggettiva e del background della singola persona.

APA, Harvard, Vancouver, ISO, and other styles

2

Mondaini, Luca. "Data Visualization di dati spazio-temporali." Bachelor's thesis, Alma Mater Studiorum - Università di Bologna, 2018. http://amslaurea.unibo.it/16853/.

Full text

Abstract:

Questa tesi propone, dopo aver introdotto le tipologie di dato esistenti e i concetti di Big Data, Open Data e Data Visualization, due obiettivi differenti: ottenere informazioni utilizzando le API di Google Maps, più precisamente, latitudine e longitudine di ogni indirizzo di domicilio degli studenti immatricolati in UniBo e la visualizzazione di questi dati, all'interno di un'applicazione web, mediante istogrammi e mappe digitali interattivi con l'utilizzo di tecniche di Data Visualization.

APA, Harvard, Vancouver, ISO, and other styles

3

Yu, Wenyuan. "Improving data quality : data consistency, deduplication, currency and accuracy." Thesis, University of Edinburgh, 2013. http://hdl.handle.net/1842/8899.

Full text

Abstract:

Data quality is one of the key problems in data management. An unprecedented amount of data has been accumulated and has become a valuable asset of an organization. The value of the data relies greatly on its quality. However, data is often dirty in real life. It may be inconsistent, duplicated, stale, inaccurate or incomplete, which can reduce its usability and increase the cost of businesses. Consequently the need for improving data quality arises, which comprises of five central issues of improving data quality, namely, data consistency, data deduplication, data currency, data accuracy and information completeness. This thesis presents the results of our work on the first four issues with regards to data consistency, deduplication, currency and accuracy. The first part of the thesis investigates incremental verifications of data consistencies in distributed data. Given a distributed database D, a set S of conditional functional dependencies (CFDs), the set V of violations of the CFDs in D, and updates ΔD to D, it is to find, with minimum data shipment, changes ΔV to V in response to ΔD. Although the problems are intractable, we show that they are bounded: there exist algorithms to detect errors such that their computational cost and data shipment are both linear in the size of ΔD and ΔV, independent of the size of the database D. Such incremental algorithms are provided for both vertically and horizontally partitioned data, and we show that the algorithms are optimal. The second part of the thesis studies the interaction between record matching and data repairing. Record matching, the main technique underlying data deduplication, aims to identify tuples that refer to the same real-world object, and repairing is to make a database consistent by fixing errors in the data using constraints. These are treated as separate processes in most data cleaning systems, based on heuristic solutions. However, our studies show that repairing can effectively help us identify matches, and vice versa. To capture the interaction, a uniform framework that seamlessly unifies repairing and matching operations is proposed to clean a database based on integrity constraints, matching rules and master data. The third part of the thesis presents our study of finding certain fixes that are absolutely correct for data repairing. Data repairing methods based on integrity constraints are normally heuristic, and they may not find certain fixes. Worse still, they may even introduce new errors when attempting to repair the data, which may not work well when repairing critical data such as medical records, in which a seemingly minor error often has disastrous consequences. We propose a framework and an algorithm to find certain fixes, based on master data, a class of editing rules and user interactions. A prototype system is also developed. The fourth part of the thesis introduces inferring data currency and consistency for conflict resolution, where data currency aims to identify the current values of entities, and conflict resolution is to combine tuples that pertain to the same real-world entity into a single tuple and resolve conflicts, which is also an important issue for data deduplication. We show that data currency and consistency help each other in resolving conflicts. We study a number of associated fundamental problems, and develop an approach for conflict resolution by inferring data currency and consistency. The last part of the thesis reports our study of data accuracy on the longstanding relative accuracy problem which is to determine, given tuples t1 and t2 that refer to the same entity e, whether t1[A] is more accurate than t2[A], i.e., t1[A] is closer to the true value of the A attribute of e than t2[A]. We introduce a class of accuracy rules and an inference system with a chase procedure to deduce relative accuracy, and the related fundamental problems are studied. We also propose a framework and algorithms for inferring accurate values with users’ interaction.

APA, Harvard, Vancouver, ISO, and other styles

4

Long, Christopher C. "Data Processing for NASA's TDRSS DAMA Channel." International Foundation for Telemetering, 1996. http://hdl.handle.net/10150/611474.

Full text

Abstract:

International Telemetering Conference Proceedings / October 28-31, 1996 / Town and Country Hotel and Convention Center, San Diego, California
Presently, NASA's Space Network (SN) does not have the ability to receive random messages from satellites using the system. Scheduling of the service must be done by the owner of the spacecraft through Goddard Space Flight Center (GSFC). The goal of NASA is to improve the current system so that random messages, that are generated on board the satellite, can be received by the SN. The messages will be requests for service that the satellites control system deems necessary. These messages will then be sent to the owner of the spacecraft where appropriate action and scheduling can take place. This new service is known as the Demand Assignment Multiple Access system (DAMA).

APA, Harvard, Vancouver, ISO, and other styles

5

Budd, Chris. "Data Protection and Data Elimination." International Foundation for Telemetering, 2015. http://hdl.handle.net/10150/596395.

Full text

Abstract:

ITC/USA 2015 Conference Proceedings / The Fifty-First Annual International Telemetering Conference and Technical Exhibition / October 26-29, 2015 / Bally's Hotel & Convention Center, Las Vegas, NV
Data security is becoming increasingly important in all areas of storage. The news services frequently have stories about lost or stolen storage devices and the panic it causes. Data security in an SSD usually involves two components: data protection and data elimination. Data protection includes passwords to protect against unauthorized access and encryption to protect against recovering data from the flash chips. Data elimination includes erasing the encryption key and erasing the flash. Telemetry applications frequently add requirements such as write protection, external erase triggers, and overwriting the flash after the erase. This presentation will review these data security features.

APA, Harvard, Vancouver, ISO, and other styles

6

Furrier, Sean Alexander, and Sean Alexander Furrier. "Communicating Data: Data-Driven Storytelling." Thesis, The University of Arizona, 2017. http://hdl.handle.net/10150/624989.

Full text

Abstract:

Data is more abundant than ever, yet its utility is diminished by a lack of understanding and difficulty in communicating insights. This thesis seeks to test the effectiveness of data-driven storytelling as a means to solve this disconnect. Research conducted includes reading previous literature on the subject, interviewing journalists and data practitioners as well as learning to use various software tools. This research focuses on communicating engaging stories by finding, cleaning, analyzing and visualizing data using R, Python, Excel, Tableau, Carto and other software tools. The result is a series of data-driven stories published in the Daily Wildcat on a variety of subjects including campus life, politics, and sports. The conclusion of the thesis finds that data-driven storytelling is an effective medium for communicating data and capitalizing on its potential utility. This conclusion is drawn from the fact that humans intuitively understand narrative and data insights parsed out in this familiar form are more easily understood than data presented in an abstract manner.

APA, Harvard, Vancouver, ISO, and other styles

7

Chitondo, Pepukayi David Junior. "Data policies for big health data and personal health data." Thesis, Cape Peninsula University of Technology, 2016. http://hdl.handle.net/20.500.11838/2479.

Full text

Abstract:

Thesis (MTech (Information Technology))--Cape Peninsula University of Technology, 2016.
Health information policies are constantly becoming a key feature in directing information usage in healthcare. After the passing of the Health Information Technology for Economic and Clinical Health (HITECH) Act in 2009 and the Affordable Care Act (ACA) passed in 2010, in the United States, there has been an increase in health systems innovations. Coupling this health systems hype is the current buzz concept in Information Technology, „Big data‟. The prospects of big data are full of potential, even more so in the healthcare field where the accuracy of data is life critical. How big health data can be used to achieve improved health is now the goal of the current health informatics practitioner. Even more exciting is the amount of health data being generated by patients via personal handheld devices and other forms of technology that exclude the healthcare practitioner. This patient-generated data is also known as Personal Health Records, PHR. To achieve meaningful use of PHRs and healthcare data in general through big data, a couple of hurdles have to be overcome. First and foremost is the issue of privacy and confidentiality of the patients whose data is in concern. Secondly is the perceived trustworthiness of PHRs by healthcare practitioners. Other issues to take into context are data rights and ownership, data suppression, IP protection, data anonymisation and reidentification, information flow and regulations as well as consent biases. This study sought to understand the role of data policies in the process of data utilisation in the healthcare sector with added interest on PHRs utilisation as part of big health data.

APA, Harvard, Vancouver, ISO, and other styles

8

BRASCHI, GIACOMO. "La circolazione dei dati e l'analisi big data." Doctoral thesis, Università degli studi di Pavia, 2019. http://hdl.handle.net/11571/1244327.

Full text

Abstract:

Descrizione degli strumenti giuridici che regolano la circolazione dei dati e analisi dei possibili sviluppi normativi auspicabile per favori la circolazione dei dati
Description of the legal instruments that regulate the circulation of data and analysis of possible legislative developments desirable to favor the circulation of data

APA, Harvard, Vancouver, ISO, and other styles

9

Perovich, Laura J. (Laura Jones). "Data Experiences : novel interfaces for data engagement using environmental health data." Thesis, Massachusetts Institute of Technology, 2014. http://hdl.handle.net/1721.1/95612.

Full text

Abstract:

Thesis: S.M., Massachusetts Institute of Technology, School of Architecture and Planning, Program in Media Arts and Sciences, 2014.
Cataloged from PDF version of thesis.
Includes bibliographical references (pages 71-81).
For the past twenty years, the data visualization movement has reworked the way we engage with information. It has brought fresh excitement to researchers and reached broad audiences. But what comes next for data? I seek to create example "Data Experiences" that will contribute to developing new spaces of information engagement. Using data from Silent Spring Institute's environmental health studies as a test case, I explore Data Experiences that are immersive, interactive, and aesthetic. Environmental health datasets are ideal for this application as they are highly relevant to the general population and have appropriate complexity. Dressed in Data will focus on the experience of an individual with her/his own environmental health data while BigBarChart focuses on the experience of the community with the overall dataset. Both projects seek to present opportunities for nontraditional learning, community relevance, and social impact.
by Laura J. Perovich.
S.M.

APA, Harvard, Vancouver, ISO, and other styles

10

Wang, Yi. "Data Management and Data Processing Support on Array-Based Scientific Data." The Ohio State University, 2015. http://rave.ohiolink.edu/etdc/view?acc_num=osu1436157356.

Full text

APA, Harvard, Vancouver, ISO, and other styles

11

Dedge, Parks Dana M. "Defining Data Science and Data Scientist." Thesis, University of South Florida, 2018. http://pqdtopen.proquest.com/#viewpdf?dispub=10639701.

Full text

Abstract:

The world’s data sets are growing exponentially every day due to the large number of devices generating data residue across the multitude of global data centers. What to do with the massive data stores, how to manage them and defining who are performing these tasks has not been adequately defined and agreed upon by academics and practitioners. Data science is a cross disciplinary, amalgam of skills, techniques and tools which allow business organizations to identify trends and build assumptions which lead to key decisions. It is in an evolutionary state as new technologies with capabilities are still being developed snd deployed. The data science tasks and the data scientist skills needed in order to be successful with the analytics across the data stores are defined in this document. The research conducted across twenty-two academic articles, one book, eleven interviews and seventy-eight surveys are combined to articulate the convergence on the terms data science. In addition, the research identified that there are five key skill categories (themes) which have fifty-five competencies that are used globally by data scientists to successfully perform the art and science activities of data science.

Unspecified portions of statistics, technology programming, development of models and calculations are combined to determine outcomes which lead global organizations to make strategic decisions every day.

This research is intended to provide a constructive summary about the topics data science and data scientist in order to spark the dialogue for us to formally finalize the definitions and ultimately change the world by establishing set guidelines on how data science is performed and measured.

APA, Harvard, Vancouver, ISO, and other styles

12

Proskurnia, Iuliia. "Genium Data Store : Distributed Data store." Thesis, KTH, Skolan för informations- och kommunikationsteknik (ICT), 2013. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-141552.

Full text

Abstract:

In recent years the need for distributed data storage has led the way to design new systems in a large-scale environment. The growth of unbounded stream of data, the necessity to store and analyze it in real time, reliably, scalable and fast are the reasons for appearance of such systems in financial sector, stock exchange Nasdaq OMX especially. Furthermore, internally designed totally ordered reliable message bus is used in Nasdaq OMX for almost all internal subsystems. Theoretical and practical extensive studies on reliable totally ordered multicast were made in academia and it was proven to serve as a fundamental block in construction of distributed fault-tolerant applications. In this work, we are leveraging NOMX low-latency reliable totally ordered message bus with a capacity of at least 2 million messages per second to build high performance distributed data store. The data operations consistency can be easily achieved by using the messaging bus as it forwards all messages in reliable total order fashion. Moreover, relying on the reliable totally ordered messaging, active in-memory replication support for fault tolerance and load balancing is integrated. Consequently, the prototype was developed using production environment requirements to demonstrate its feasibility. Experimental results show a great scalability and performance serving around 400,000 insert operations per second over 6 data nodes that can be served with 100 microseconds latency. Latency for single record read operations are bound to sub-half millisecond, while data ranges are retrieved with sub-100 Mbps capacity from one node. Moreover, performance improvements under a greater number of data store nodes are shown for both writes and reads. It is concluded that uniform totally ordered sequenced input data can be used in real time for large-scale distributed data storage to maintain strong consistency, fault-tolerance and high performance.

APA, Harvard, Vancouver, ISO, and other styles

13

Morshedzadeh, Iman. "Data Classification in Product Data Management." Thesis, Högskolan i Skövde, Institutionen för teknik och samhälle, 2013. http://urn.kb.se/resolve?urn=urn:nbn:se:his:diva-14651.

Full text

Abstract:

This report is about the product data classification methodology that is useable for the Volvo Cars Engine (VCE) factory's production data, and can be implemented in the Teamcenter software. There are many data generated during the life cycle of each product, and companies try to manage these data with some product data management software. Data classification is a part of data management for most effective and efficient use of data. With surveys that were done in this project, items affecting the data classification have been found. Data, attributes, classification method, Volvo Cars Engine factory and Teamcenter as the product data management software, are items that are affected data classification. In this report, all of these items will be explained separately. With the knowledge obtained about the above items, in the Volvo Cars Engine factory, the suitable hierarchical classification method is described. After defining the classification method, this method has been implemented in the software at the last part of the report to show that this method is executable.

APA, Harvard, Vancouver, ISO, and other styles

14

Dedge, Parks Dana M. "Defining Data Science and Data Scientist." Scholar Commons, 2017. http://scholarcommons.usf.edu/etd/7014.

Full text

Abstract:

The world’s data sets are growing exponentially every day due to the large number of devices generating data residue across the multitude of global data centers. What to do with the massive data stores, how to manage them and defining who are performing these tasks has not been adequately defined and agreed upon by academics and practitioners. Data science is a cross disciplinary, amalgam of skills, techniques and tools which allow business organizations to identify trends and build assumptions which lead to key decisions. It is in an evolutionary state as new technologies with capabilities are still being developed and deployed. The data science tasks and the data scientist skills needed in order to be successful with the analytics across the data stores are defined in this document. The research conducted across twenty-two academic articles, one book, eleven interviews and seventy-eight surveys are combined to articulate the convergence on the terms data science. In addition, the research identified that there are five key skill categories (themes) which have fifty-five competencies that are used globally by data scientists to successfully perform the art and science activities of data science. Unspecified portions of statistics, technology programming, development of models and calculations are combined to determine outcomes which lead global organizations to make strategic decisions every day. This research is intended to provide a constructive summary about the topics data science and data scientist in order to spark the dialogue for us to formally finalize the definitions and ultimately change the world by establishing set guidelines on how data science is performed and measured.

APA, Harvard, Vancouver, ISO, and other styles

15

Strand, Mattias. "External Data Incorporation into Data Warehouses." Doctoral thesis, Kista : Skövde : Dept. of computer and system sciences, Stockholm University : School of humanities and informatics, University of Skövde, 2005. http://urn.kb.se/resolve?urn=urn:nbn:se:su:diva-660.

Full text

APA, Harvard, Vancouver, ISO, and other styles

16

Radhakrishnan, Radhika. "Genome data modeling and data compression." abstract and full text PDF (free order & download UNR users only), 2007. http://0-gateway.proquest.com.innopac.library.unr.edu/openurl?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation&res_dat=xri:pqdiss&rft_dat=xri:pqdiss:1447611.

Full text

APA, Harvard, Vancouver, ISO, and other styles

17

Abedjan, Ziawasch. "Improving RDF data with data mining." Phd thesis, Universität Potsdam, 2014. http://opus.kobv.de/ubp/volltexte/2014/7133/.

Full text

Abstract:

Linked Open Data (LOD) comprises very many and often large public data sets and knowledge bases. Those datasets are mostly presented in the RDF triple structure of subject, predicate, and object, where each triple represents a statement or fact. Unfortunately, the heterogeneity of available open data requires significant integration steps before it can be used in applications. Meta information, such as ontological definitions and exact range definitions of predicates, are desirable and ideally provided by an ontology. However in the context of LOD, ontologies are often incomplete or simply not available. Thus, it is useful to automatically generate meta information, such as ontological dependencies, range definitions, and topical classifications. Association rule mining, which was originally applied for sales analysis on transactional databases, is a promising and novel technique to explore such data. We designed an adaptation of this technique for min-ing Rdf data and introduce the concept of “mining configurations”, which allows us to mine RDF data sets in various ways. Different configurations enable us to identify schema and value dependencies that in combination result in interesting use cases. To this end, we present rule-based approaches for auto-completion, data enrichment, ontology improvement, and query relaxation. Auto-completion remedies the problem of inconsistent ontology usage, providing an editing user with a sorted list of commonly used predicates. A combination of different configurations step extends this approach to create completely new facts for a knowledge base. We present two approaches for fact generation, a user-based approach where a user selects the entity to be amended with new facts and a data-driven approach where an algorithm discovers entities that have to be amended with missing facts. As knowledge bases constantly grow and evolve, another approach to improve the usage of RDF data is to improve existing ontologies. Here, we present an association rule based approach to reconcile ontology and data. Interlacing different mining configurations, we infer an algorithm to discover synonymously used predicates. Those predicates can be used to expand query results and to support users during query formulation. We provide a wide range of experiments on real world datasets for each use case. The experiments and evaluations show the added value of association rule mining for the integration and usability of RDF data and confirm the appropriateness of our mining configuration methodology.
Linked Open Data (LOD) umfasst viele und oft sehr große öffentlichen Datensätze und Wissensbanken, die hauptsächlich in der RDF Triplestruktur bestehend aus Subjekt, Prädikat und Objekt vorkommen. Dabei repräsentiert jedes Triple einen Fakt. Unglücklicherweise erfordert die Heterogenität der verfügbaren öffentlichen Daten signifikante Integrationsschritte bevor die Daten in Anwendungen genutzt werden können. Meta-Daten wie ontologische Strukturen und Bereichsdefinitionen von Prädikaten sind zwar wünschenswert und idealerweise durch eine Wissensbank verfügbar. Jedoch sind Wissensbanken im Kontext von LOD oft unvollständig oder einfach nicht verfügbar. Deshalb ist es nützlich automatisch Meta-Informationen, wie ontologische Abhängigkeiten, Bereichs-und Domänendefinitionen und thematische Assoziationen von Ressourcen generieren zu können. Eine neue und vielversprechende Technik um solche Daten zu untersuchen basiert auf das entdecken von Assoziationsregeln, welche ursprünglich für Verkaufsanalysen in transaktionalen Datenbanken angewendet wurde. Wir haben eine Adaptierung dieser Technik auf RDF Daten entworfen und stellen das Konzept der Mining Konfigurationen vor, welches uns befähigt in RDF Daten auf unterschiedlichen Weisen Muster zu erkennen. Verschiedene Konfigurationen erlauben uns Schema- und Wertbeziehungen zu erkennen, die für interessante Anwendungen genutzt werden können. In dem Sinne, stellen wir assoziationsbasierte Verfahren für eine Prädikatvorschlagsverfahren, Datenvervollständigung, Ontologieverbesserung und Anfrageerleichterung vor. Das Vorschlagen von Prädikaten behandelt das Problem der inkonsistenten Verwendung von Ontologien, indem einem Benutzer, der einen neuen Fakt einem Rdf-Datensatz hinzufügen will, eine sortierte Liste von passenden Prädikaten vorgeschlagen wird. Eine Kombinierung von verschiedenen Konfigurationen erweitert dieses Verfahren sodass automatisch komplett neue Fakten für eine Wissensbank generiert werden. Hierbei stellen wir zwei Verfahren vor, einen nutzergesteuertenVerfahren, bei dem ein Nutzer die Entität aussucht die erweitert werden soll und einen datengesteuerten Ansatz, bei dem ein Algorithmus selbst die Entitäten aussucht, die mit fehlenden Fakten erweitert werden. Da Wissensbanken stetig wachsen und sich verändern, ist ein anderer Ansatz um die Verwendung von RDF Daten zu erleichtern die Verbesserung von Ontologien. Hierbei präsentieren wir ein Assoziationsregeln-basiertes Verfahren, der Daten und zugrundeliegende Ontologien zusammenführt. Durch die Verflechtung von unterschiedlichen Konfigurationen leiten wir einen neuen Algorithmus her, der gleichbedeutende Prädikate entdeckt. Diese Prädikate können benutzt werden um Ergebnisse einer Anfrage zu erweitern oder einen Nutzer während einer Anfrage zu unterstützen. Für jeden unserer vorgestellten Anwendungen präsentieren wir eine große Auswahl an Experimenten auf Realweltdatensätzen. Die Experimente und Evaluierungen zeigen den Mehrwert von Assoziationsregeln-Generierung für die Integration und Nutzbarkeit von RDF Daten und bestätigen die Angemessenheit unserer konfigurationsbasierten Methodologie um solche Regeln herzuleiten.

APA, Harvard, Vancouver, ISO, and other styles

18

Fahmy, A. "Data encryption of communication data links." Thesis, University of Kent, 1994. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.385199.

Full text

APA, Harvard, Vancouver, ISO, and other styles

19

Sun, Wenjun. "Parallel data processing for semistructured data." Thesis, London South Bank University, 2006. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.434394.

Full text

APA, Harvard, Vancouver, ISO, and other styles

20

Wiemann, Stefan. "Data Fusion in Spatial Data Infrastructures." Doctoral thesis, Saechsische Landesbibliothek- Staats- und Universitaetsbibliothek Dresden, 2017. http://nbn-resolving.de/urn:nbn:de:bsz:14-qucosa-216985.

Full text

Abstract:

Over the past decade, the public awareness and availability as well as methods for the creation and use of spatial data on the Web have steadily increased. Besides the establishment of governmental Spatial Data Infrastructures (SDIs), numerous volunteered and commercial initiatives had a major impact on that development. Nevertheless, data isolation still poses a major challenge. Whereas the majority of approaches focuses on data provision, means to dynamically link and combine spatial data from distributed, often heterogeneous data sources in an ad hoc manner are still very limited. However, such capabilities are essential to support and enhance information retrieval for comprehensive spatial decision making. To facilitate spatial data fusion in current SDIs, this thesis has two main objectives. First, it focuses on the conceptualization of a service-based fusion process to functionally extend current SDI and to allow for the combination of spatial data from different spatial data services. It mainly addresses the decomposition of the fusion process into well-defined and reusable functional building blocks and their implementation as services, which can be used to dynamically compose meaningful application-specific processing workflows. Moreover, geoprocessing patterns, i.e. service chains that are commonly used to solve certain fusion subtasks, are designed to simplify and automate workflow composition. Second, the thesis deals with the determination, description and exploitation of spatial data relations, which play a decisive role for spatial data fusion. The approach adopted is based on the Linked Data paradigm and therefore bridges SDI and Semantic Web developments. Whereas the original spatial data remains within SDI structures, relations between those sources can be used to infer spatial information by means of Semantic Web standards and software tools. A number of use cases were developed, implemented and evaluated to underpin the proposed concepts. Particular emphasis was put on the use of established open standards to realize an interoperable, transparent and extensible spatial data fusion process and to support the formalized description of spatial data relations. The developed software, which is based on a modular architecture, is available online as open source. It allows for the development and seamless integration of new functionality as well as the use of external data and processing services during workflow composition on the Web
Die Entwicklung des Internet im Laufe des letzten Jahrzehnts hat die Verfügbarkeit und öffentliche Wahrnehmung von Geodaten, sowie Möglichkeiten zu deren Erfassung und Nutzung, wesentlich verbessert. Dies liegt sowohl an der Etablierung amtlicher Geodateninfrastrukturen (GDI), als auch an der steigenden Anzahl Communitybasierter und kommerzieller Angebote. Da der Fokus zumeist auf der Bereitstellung von Geodaten liegt, gibt es jedoch kaum Möglichkeiten die Menge an, über das Internet verteilten, Datensätzen ad hoc zu verlinken und zusammenzuführen, was mitunter zur Isolation von Geodatenbeständen führt. Möglichkeiten zu deren Fusion sind allerdings essentiell, um Informationen zur Entscheidungsunterstützung in Bezug auf raum-zeitliche Fragestellungen zu extrahieren. Um eine ad hoc Fusion von Geodaten im Internet zu ermöglichen, behandelt diese Arbeit zwei Themenschwerpunkte. Zunächst wird eine dienstebasierten Umsetzung des Fusionsprozesses konzipiert, um bestehende GDI funktional zu erweitern. Dafür werden wohldefinierte, wiederverwendbare Funktionsblöcke beschrieben und über standardisierte Diensteschnittstellen bereitgestellt. Dies ermöglicht eine dynamische Komposition anwendungsbezogener Fusionsprozesse über das Internet. Des weiteren werden Geoprozessierungspatterns definiert, um populäre und häufig eingesetzte Diensteketten zur Bewältigung bestimmter Teilaufgaben der Geodatenfusion zu beschreiben und die Komposition und Automatisierung von Fusionsprozessen zu vereinfachen. Als zweiten Schwerpunkt beschäftigt sich die Arbeit mit der Frage, wie Relationen zwischen Geodatenbeständen im Internet erstellt, beschrieben und genutzt werden können. Der gewählte Ansatz basiert auf Linked Data Prinzipien und schlägt eine Brücke zwischen diensteorientierten GDI und dem Semantic Web. Während somit Geodaten in bestehenden GDI verbleiben, können Werkzeuge und Standards des Semantic Web genutzt werden, um Informationen aus den ermittelten Geodatenrelationen abzuleiten. Zur Überprüfung der entwickelten Konzepte wurde eine Reihe von Anwendungsfällen konzipiert und mit Hilfe einer prototypischen Implementierung umgesetzt und anschließend evaluiert. Der Schwerpunkt lag dabei auf einer interoperablen, transparenten und erweiterbaren Umsetzung dienstebasierter Fusionsprozesse, sowie einer formalisierten Beschreibung von Datenrelationen, unter Nutzung offener und etablierter Standards. Die Software folgt einer modularen Struktur und ist als Open Source frei verfügbar. Sie erlaubt sowohl die Entwicklung neuer Funktionalität durch Entwickler als auch die Einbindung existierender Daten- und Prozessierungsdienste während der Komposition eines Fusionsprozesses

APA, Harvard, Vancouver, ISO, and other styles

21

Virinchi, Billa. "Data Visualization of Telenor mobility data." Thesis, Blekinge Tekniska Högskola, Institutionen för kommunikationssystem, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-13951.

Full text

Abstract:

Nowadays with the rapid development of cities, understanding the human mobility patterns of subscribers is crucial for urban planning and for network infrastructure deployment. Today mobile phones are electronic devices used for analyzing the mobility patterns of the subscribers in the network, because humans in their daily activities they carry mobile phones for communication purpose. For effective utilization of network infrastructure (NI) there is a need to study on mobility patterns of subscribers. The aim of the thesis is to simulate the geospatial Telenor mobility data (i.e. three different subscriber categorized segments) and provide a visual support in google maps using google maps API, which helps in decision making to the telecommunication operators for effective utilization of network infrastructure (NI). In this thesis there are two major objectives. Firstly, categorize the given geospatial telenor mobility data using subscriber mobility algorithm. Secondly, providing a visual support for the obtained categorized geospatial telenor mobility data in google maps using a geovisualization simulation tool. The algorithm used to categorize the given geospatial telenor mobility data is subscriber mobility algorithm. Where this subscriber mobility algorithm categorizes the subscribers into three different segments (i.e. infrastructure stressing, medium, friendly). For validation and confirmation purpose of subscriber mobility algorithm a tetris optimization model is used. To give visual support for each categorized segments a simulation tool is developed and it displays the visualization results in google maps using Google Maps API. The result of this thesis are presented to the above formulated objectives. By using subscriber mobility algorithm and tetris optimization model to a geospatial data set of 33,045 subscribers only 1400 subscribers are found as infrastructure stressing subscribers. To look informative, a small region (i.e. boras region) is taken to visualize the subscribers from each of the categorized segments (i.e. infrastructure stressing, medium, friendly). The conclusion of the thesis is that the functionality thus developed contributes to knowledge discovery from geospatial data and provides visual support for decision making to telecommunication operators. Nowadays with the rapid development of cities, understanding the human mobility patterns of subscribers is crucial for urban planning and for network infrastructure deployment. Today mobile phones are electronic devices used for analyzing the mobility patterns of the subscribers in the network, because humans in their daily activities they carry mobile phones for communication purpose. For effective utilization of network infrastructure (NI) there is a need to study on mobility patterns of subscribers. The aim of the thesis is to simulate the geospatial Telenor mobility data (i.e. three different subscriber categorized segments) and provide a visual support in google maps using google maps API, which helps in decision making to the telecommunication operators for effective utilization of network infrastructure (NI). In this thesis there are two major objectives. Firstly, categorize the given geospatial telenor mobility data using subscriber mobility algorithm. Secondly, providing a visual support for the obtained categorized geospatial telenor mobility data in google maps using a geovisualization simulation tool. The algorithm used to categorize the given geospatial telenor mobility data is subscriber mobility algorithm. Where this subscriber mobility algorithm categorizes the subscribers into three different segments (i.e. infrastructure stressing, medium, friendly). For validation and confirmation purpose of subscriber mobility algorithm a tetris optimization model is used. To give visual support for each categorized segments a simulation tool is developed and it displays the visualization results in google maps using Google Maps API. The result of this thesis are presented to the above formulated objectives. By using subscriber mobility algorithm and tetris optimization model to a geospatial data set of 33,045 subscribers only 1400 subscribers are found as infrastructure stressing subscribers. To look informative, a small region (i.e. boras region) is taken to visualize the subscribers from each of the categorized segments (i.e. infrastructure stressing, medium, friendly). The conclusion of the thesis is that the functionality thus developed contributes to knowledge discovery from geospatial data and provides visual support for decision making to telecommunication operators.

APA, Harvard, Vancouver, ISO, and other styles

22

Gullipalli, Deep Kumar. "Data envelopment analysis with sparse data." Thesis, Kansas State University, 2011. http://hdl.handle.net/2097/13092.

Full text

Abstract:

Master of Science
Department of Industrial & Manufacturing Systems Engineering
David H. Ben-Arieh
Quest for continuous improvement among the organizations and issue of missing data for data analysis are never ending. This thesis brings these two topics under one roof, i.e., to evaluate the productivity of organizations with sparse data. This study focuses on Data Envelopment Analysis (DEA) to determine the efficiency of 41 member clinics of Kansas Association of Medically Underserved (KAMU) with missing data. The primary focus of this thesis is to develop new reliable methods to determine the missing values and to execute DEA. DEA is a linear programming methodology to evaluate relative technical efficiency of homogenous Decision Making Units, using multiple inputs and outputs. Effectiveness of DEA depends on the quality and quantity of data being used. DEA outcomes are susceptible to missing data, thus, creating a need to supplement sparse data in a reliable manner. Determining missing values more precisely improves the robustness of DEA methodology. Three methods to determine the missing values are proposed in this thesis based on three different platforms. First method named as Average Ratio Method (ARM) uses average value, of all the ratios between two variables. Second method is based on a modified Fuzzy C-Means Clustering algorithm, which can handle missing data. The issues associated with this clustering algorithm are resolved to improve its effectiveness. Third method is based on interval approach. Missing values are replaced by interval ranges estimated by experts. Crisp efficiency scores are identified in similar lines to how DEA determines efficiency scores using the best set of weights. There exists no unique way to evaluate the effectiveness of these methods. Effectiveness of these methods is tested by choosing a complete dataset and assuming varying levels of data as missing. Best set of recovered missing values, based on the above methods, serves as a source to execute DEA. Results show that the DEA efficiency scores generated with recovered values are close within close proximity to the actual efficiency scores that would be generated with the complete data. As a summary, this thesis provides an effective and practical approach for replacing missing values needed for DEA.

APA, Harvard, Vancouver, ISO, and other styles

23

Kostopoulos, A. "Combinatorial data analysis for data ordering." Thesis, University of Liverpool, 2016. http://livrepository.liverpool.ac.uk/3004631/.

Full text

Abstract:

Seriation is a combinatorial optimisation problem that aims to sequence a set of objects such that a natural ordering is created. A large variety of applications exist ranging from archaeology to bioinformatics and text mining. Initially, a thorough and useful quantitative analysis compares different seriation algorithms using the positional proximity coefficient (PPC). This analysis helps the practitioner to understand how similar two algorithms are for a given set of datasets. The first contribution is consensus seriation. This method uses the principles of other consensus based methods to combine different seriation solutions according to the PPC. As it creates a solution that no individual algorithm can create, the usefulness comes in the form of combining different structural elements from each original algorithms. In particular, it is possible to create a solution that combines the local characteristics of one algorithm together with the global characteristics of another. Experimental results show that compared to consensus ranking based methods, using the Hamming, Spearman and Kendall coefficients, the consensus seriation using the PPC gives generally superior results according to the independent accumulated relative generalised anti-Robinson events measure. The second contribution is a metaheuristic for creating good approximation solutions very large seriation problems. This adapted harmony search algorithm makes use of modified crossover operators taken from genetic algorithm literature to optimise the least-squares criterion commonly used in seriation. As for all combinatorial optimisation problems, there is a need for metaheuristics that can produce better solutions quicker. Results show that that algorithm consistently outperforms existing metaheuristic algorithms such as genetic algorithm, particle swarm optimisation, simulated annealing and tabu search as well as the genetic algorithm using the modified crossover operators, with the main advantage of creating a much superior result in a very short iteration frame. These two major contributions offer practitioners and academics with new tools to tackle seriation related problems and a suggested direction for future work is concluded.

APA, Harvard, Vancouver, ISO, and other styles

24

Tata, Maitreyi. "Data analytics on Yelp data set." Kansas State University, 2017. http://hdl.handle.net/2097/38237.

Full text

Abstract:

Master of Science
Department of Computing and Information Sciences
William H. Hsu
In this report, I describe a query-driven system which helps in deciding which restaurant to invest in or which area is good to open a new restaurant in a specific place. Analysis is performed on already existing businesses in every state. This is based on certain factors such as the average star rating, the total number of reviews associated with a specific restaurant, the price range of the restaurant etc. The results will give an idea of successful restaurants in a city, which helps you decide where to invest and what are the things to be kept in mind while starting a new business. The main scope of the project is to concentrate on Analytics and Data Visualization.

APA, Harvard, Vancouver, ISO, and other styles

25

Liu, Tantan. "Data Mining over Hidden Data Sources." The Ohio State University, 2012. http://rave.ohiolink.edu/etdc/view?acc_num=osu1343313341.

Full text

APA, Harvard, Vancouver, ISO, and other styles

26

Yang, Ying. "Interactive Data Management and Data Analysis." Thesis, State University of New York at Buffalo, 2017. http://pqdtopen.proquest.com/#viewpdf?dispub=10288109.

Full text

Abstract:

Everyone today has a big data problem. Data is everywhere and in different formats, they can be referred to as data lakes, data streams, or data swamps. To extract knowledge or insights from the data or to support decision-making, we need to go through a process of collecting, cleaning, managing and analyzing the data. In this process, data cleaning and data analysis are two of the most important and time-consuming components.

One common challenge in these two components is a lack of interaction. The data cleaning and data analysis are typically done as a batch process, operating on the whole dataset without any feedback. This leads to long, frustrating delays during which users have no idea if the process is effective. Lacking interaction, human expert effort is needed to make decisions on which algorithms or parameters to use in the systems for these two components.

We should teach computers to talk to humans, not the other way around. This dissertation focuses on building systems --- Mimir and CIA --- that help user conduct data cleaning and analysis through interaction. Mimir is a system that allows users to clean big data in a cost- and time-efficient way through interaction, a process I call on-demand ETL. Convergent inference algorithms (CIA) are a family of inference algorithms in probabilistic graphical models (PGM) that enjoys the benefit of both exact and approximate inference algorithms through interaction.

Mimir provides a general language for user to express different data cleaning needs. It acts as a shim layer that wraps around the database making it possible for the bulk of the ETL process to remain within a classical deterministic system. Mimir also helps users to measure the quality of an analysis result and provides rankings for cleaning tasks to improve the result quality in a cost efficient manner. CIA focuses on providing user interaction through the process of inference in PGMs. The goal of CIA is to free users from the upfront commitment to either approximate or exact inference, and provide user more control over time/accuracy trade-offs to direct decision-making and computation instance allocations. This dissertation describes the Mimir and CIA frameworks to demonstrate that it is feasible to build efficient interactive data management and data analysis systems.

APA, Harvard, Vancouver, ISO, and other styles

27

Taylor, Phillip. "Data mining of vehicle telemetry data." Thesis, University of Warwick, 2015. http://wrap.warwick.ac.uk/77645/.

Full text

Abstract:

Driving a safety critical task that requires a high level of attention and workload from the driver. Despite this, people often perform secondary tasks such as eating or using a mobile phone, which increase workload levels and divert cognitive and physical attention from the primary task of driving. As well as these distractions, the driver may also be overloaded for other reasons, such as dealing with an incident on the road or holding conversations in the car. One solution to this distraction problem is to limit the functionality of in-car devices while the driver is overloaded. This can take the form of withholding an incoming phone call or delaying the display of a non-urgent piece of information about the vehicle. In order to design and build these adaptions in the car, we must first have an understanding of the driver's current level of workload. Traditionally, driver workload has been monitored using physiological sensors or camera systems in the vehicle. However, physiological systems are often intrusive and camera systems can be expensive and are unreliable in poor light conditions. It is important, therefore, to use methods that are non-intrusive, inexpensive and robust, such as sensors already installed on the car and accessible via the Controller Area Network (CAN)-bus. This thesis presents a data mining methodology for this problem, as well as for others in domains with similar types of data, such as human activity monitoring. It focuses on the variable selection stage of the data mining process, where inputs are chosen for models to learn from and make inferences. Selecting inputs from vehicle telemetry data is challenging because there are many irrelevant variables with a high level of redundancy. Furthermore, data in this domain often contains biases because only relatively small amounts can be collected and processed, leading to some variables appearing more relevant to the classification task than they are really. Over the course of this thesis, a detailed variable selection framework that addresses these issues for telemetry data is developed. A novel blocked permutation method is developed and applied to mitigate biases when selecting variables from potentially biased temporal data. This approach is infeasible computationally when variable redundancies are also considered, and so a novel permutation redundancy measure with similar properties is proposed. Finally, a known redundancy structure between features in telemetry data is used to enhance the feature selection process in two ways. First the benefits of performing raw signal selection, feature extraction, and feature selection in different orders are investigated. Second, a two-stage variable selection framework is proposed and the two permutation based methods are combined. Throughout the thesis, it is shown through classification evaluations and inspection of the features that these permutation based selection methods are appropriate for use in selecting features from CAN-bus data.

APA, Harvard, Vancouver, ISO, and other styles

28

Yao, Fang. "Functional data analysis for longitudinal data /." For electronic version search Digital dissertations database. Restricted to UC campuses. Access is free to UC campus dissertations, 2003. http://uclibs.org/PID/11984.

Full text

APA, Harvard, Vancouver, ISO, and other styles

29

Ramljak, Dusan. "Data Driven High Performance Data Access." Diss., Temple University Libraries, 2018. http://cdm16002.contentdm.oclc.org/cdm/ref/collection/p245801coll10/id/530207.

Full text

Abstract:

Computer and Information Science
Ph.D.
Low-latency, high throughput mechanisms to retrieve data become increasingly crucial as the cyber and cyber-physical systems pour out increasing amounts of data that often must be analyzed in an online manner. Generally, as the data volume increases, the marginal utility of an ``average'' data item tends to decline, which requires greater effort in identifying the most valuable data items and making them available with minimal overhead. We believe that data analytics driven mechanisms have a big role to play in solving this needle-in-the-haystack problem. We rely on the claim that efficient pattern discovery and description, coupled with the observed predictability of complex patterns within many applications offers significant potential to enable many I/O optimizations. Our research covers exploitation of storage hierarchy for data driven caching and tiering, reduction of distance between data and computations, removing redundancy in data, using sparse representations of data, the impact of data access mechanisms on resilience, energy consumption, storage usage, and the enablement of new classes of data driven applications. For caching and prefetching, we offer a powerful model that separates the process of access prediction from the data retrieval mechanism. Predictions are made on a data entity basis and used the notions of ``context'' and its aspects such as ``belief'' to uncover and leverage future data needs. This approach allows truly opportunistic utilization of predictive information. We elaborate on which aspects of the context we are using in areas other than caching and prefetching different situations and why it is appropriate in the specified situation. We present in more details the methods we have developed, BeliefCache for data driven caching and prefetching and AVSC for pattern mining based compression of data. In BeliefCache, using a belief, an aspect of context representing an estimate of the probability that the storage element will be needed, we developed modular framework BeliefCache, to make unified informed decisions about that element or a group. For the workloads we examined we were able to capture complex non-sequential access patterns better than a state-of-the-art framework for optimizing cloud storage gateways. Moreover, our framework is also able to adjust to variations in the workload faster. It also does not require a static workload to be effective since modular framework allows for discovering and adapting to the changes in the workload. In AVSC, using an aspect of context to gauge the similarity of the events, we perform our compression by keeping relevant events intact and approximating other events. We do that in two stages. We first generate a summarization of the data, then approximately match the remaining events with the existing patterns if possible, or add the patterns to the summary otherwise. We show gains over the plain lossless compression for a specified amount of accuracy for purposes of identifying the state of the system and a clear tradeoff in between the compressibility and fidelity. In other mentioned research areas we present challenges and opportunities with the hope that will spur researchers to further examine those issues in the space of rapidly emerging data intensive applications. We also discuss the ideas how our research in other domains could be applied in our attempts to provide high performance data access.
Temple University--Theses

APA, Harvard, Vancouver, ISO, and other styles

30

Sherikar, Vishnu Vardhan Reddy. "I2MAPREDUCE: DATA MINING FOR BIG DATA." CSUSB ScholarWorks, 2017. https://scholarworks.lib.csusb.edu/etd/437.

Full text

Abstract:

This project is an extension of i2MapReduce: Incremental MapReduce for Mining Evolving Big Data . i2MapReduce is used for incremental big data processing, which uses a fine-grained incremental engine, a general purpose iterative model that includes iteration algorithms such as PageRank, Fuzzy-C-Means(FCM), Generalized Iterated Matrix-Vector Multiplication(GIM-V), Single Source Shortest Path(SSSP). The main purpose of this project is to reduce input/output overhead, to avoid incurring the cost of re-computation and avoid stale data mining results. Finally, the performance of i2MapReduce is analyzed by comparing the resultant graphs.

APA, Harvard, Vancouver, ISO, and other styles

31

Mathew, Avin D. "Asset management data warehouse data modelling." Thesis, Queensland University of Technology, 2008. https://eprints.qut.edu.au/19310/1/Avin_Mathew_Thesis.pdf.

Full text

Abstract:

Data are the lifeblood of an organisation, being employed by virtually all business functions within a firm. Data management, therefore, is a critical process in prolonging the life of a company and determining the success of each of an organisation’s business functions. The last decade and a half has seen data warehousing rising in priority within corporate data management as it provides an effective supporting platform for decision support tools. A cross-sectional survey conducted by this research showed that data warehousing is starting to be used within organisations for their engineering asset management, however the industry uptake is slow and has much room for development and improvement. This conclusion is also evidenced by the lack of systematic scholarly research within asset management data warehousing as compared to data warehousing for other business areas. This research is motivated by the lack of dedicated research into asset management data warehousing and attempts to provide original contributions to the area, focussing on data modelling. Integration is a fundamental characteristic of a data warehouse and facilitates the analysis of data from multiple sources. While several integration models exist for asset management, these only cover select areas of asset management. This research presents a novel conceptual data warehousing data model that integrates the numerous asset management data areas. The comprehensive ethnographic modelling methodology involved a diverse set of inputs (including data model patterns, standards, information system data models, and business process models) that described asset management data. Used as an integrated data source, the conceptual data model was verified by more than 20 experts in asset management and validated against four case studies. A large section of asset management data are stored in a relational format due to the maturity and pervasiveness of relational database management systems. Data warehousing offers the alternative approach of structuring data in a dimensional format, which suggests increased data retrieval speeds in addition to reducing analysis complexity for end users. To investigate the benefits of moving asset management data from a relational to multidimensional format, this research presents an innovative relational vs. multidimensional model evaluation procedure. To undertake an equitable comparison, the compared multidimensional are derived from an asset management relational model and as such, this research presents an original multidimensional modelling derivation methodology for asset management relational models. Multidimensional models were derived from the relational models in the asset management data exchange standard, MIMOSA OSA-EAI. The multidimensional and relational models were compared through a series of queries. It was discovered that multidimensional schemas reduced the data size and subsequently data insertion time, decreased the complexity of query conceptualisation, and improved the query execution performance across a range of query types. To facilitate the quicker uptake of these data warehouse multidimensional models within organisations, an alternate modelling methodology was investigated. This research presents an innovative approach of using a case-based reasoning methodology for data warehouse schema design. Using unique case representation and indexing techniques, the system also uses a business vocabulary repository to augment case searching and adaptation. The system was validated through a case-study where multidimensional schema design speed and accuracy was measured. It was found that the case-based reasoning system provided a marginal benefit, with a greater benefits gained when confronted with more difficult scenarios.

APA, Harvard, Vancouver, ISO, and other styles

32

Mathew, Avin D. "Asset management data warehouse data modelling." Queensland University of Technology, 2008. http://eprints.qut.edu.au/19310/.

Full text

Abstract:

Data are the lifeblood of an organisation, being employed by virtually all business functions within a firm. Data management, therefore, is a critical process in prolonging the life of a company and determining the success of each of an organisation’s business functions. The last decade and a half has seen data warehousing rising in priority within corporate data management as it provides an effective supporting platform for decision support tools. A cross-sectional survey conducted by this research showed that data warehousing is starting to be used within organisations for their engineering asset management, however the industry uptake is slow and has much room for development and improvement. This conclusion is also evidenced by the lack of systematic scholarly research within asset management data warehousing as compared to data warehousing for other business areas. This research is motivated by the lack of dedicated research into asset management data warehousing and attempts to provide original contributions to the area, focussing on data modelling. Integration is a fundamental characteristic of a data warehouse and facilitates the analysis of data from multiple sources. While several integration models exist for asset management, these only cover select areas of asset management. This research presents a novel conceptual data warehousing data model that integrates the numerous asset management data areas. The comprehensive ethnographic modelling methodology involved a diverse set of inputs (including data model patterns, standards, information system data models, and business process models) that described asset management data. Used as an integrated data source, the conceptual data model was verified by more than 20 experts in asset management and validated against four case studies. A large section of asset management data are stored in a relational format due to the maturity and pervasiveness of relational database management systems. Data warehousing offers the alternative approach of structuring data in a dimensional format, which suggests increased data retrieval speeds in addition to reducing analysis complexity for end users. To investigate the benefits of moving asset management data from a relational to multidimensional format, this research presents an innovative relational vs. multidimensional model evaluation procedure. To undertake an equitable comparison, the compared multidimensional are derived from an asset management relational model and as such, this research presents an original multidimensional modelling derivation methodology for asset management relational models. Multidimensional models were derived from the relational models in the asset management data exchange standard, MIMOSA OSA-EAI. The multidimensional and relational models were compared through a series of queries. It was discovered that multidimensional schemas reduced the data size and subsequently data insertion time, decreased the complexity of query conceptualisation, and improved the query execution performance across a range of query types. To facilitate the quicker uptake of these data warehouse multidimensional models within organisations, an alternate modelling methodology was investigated. This research presents an innovative approach of using a case-based reasoning methodology for data warehouse schema design. Using unique case representation and indexing techniques, the system also uses a business vocabulary repository to augment case searching and adaptation. The system was validated through a case-study where multidimensional schema design speed and accuracy was measured. It was found that the case-based reasoning system provided a marginal benefit, with a greater benefits gained when confronted with more difficult scenarios.

APA, Harvard, Vancouver, ISO, and other styles

33

Niggemann, Oliver. "Visual data mining of graph based data." [S.l. : s.n.], 2001. http://deposit.ddb.de/cgi-bin/dokserv?idn=962400505.

Full text

APA, Harvard, Vancouver, ISO, and other styles

34

Peralta, Veronika. "Data Quality Evaluation in Data Integration Systems." Phd thesis, Université de Versailles-Saint Quentin en Yvelines, 2006. http://tel.archives-ouvertes.fr/tel-00325139.

Full text

Abstract:

Les besoins d'accéder, de façon uniforme, à des sources de données multiples, sont chaque jour plus forts, particulièrement, dans les systèmes décisionnels qui ont besoin d'une analyse compréhensive des données. Avec le développement des Systèmes d'Intégration de Données (SID), la qualité de l'information est devenue une propriété de premier niveau de plus en plus exigée par les utilisateurs. Cette thèse porte sur la qualité des données dans les SID. Nous nous intéressons, plus précisément, aux problèmes de l'évaluation de la qualité des données délivrées aux utilisateurs en réponse à leurs requêtes et de la satisfaction des exigences des utilisateurs en terme de qualité. Nous analysons également l'utilisation de mesures de qualité pour l'amélioration de la conception du SID et de la qualité des données. Notre approche consiste à étudier un facteur de qualité à la fois, en analysant sa relation avec le SID, en proposant des techniques pour son évaluation et en proposant des actions pour son amélioration. Parmi les facteurs de qualité qui ont été proposés, cette thèse analyse deux facteurs de qualité : la fraîcheur et l'exactitude des données. Nous analysons les différentes définitions et mesures qui ont été proposées pour la fraîcheur et l'exactitude des données et nous faisons émerger les propriétés du SID qui ont un impact important sur leur évaluation. Nous résumons l'analyse de chaque facteur par le biais d'une taxonomie, qui sert à comparer les travaux existants et à faire ressortir les problèmes ouverts. Nous proposons un canevas qui modélise les différents éléments liés à l'évaluation de la qualité tels que les sources de données, les requêtes utilisateur, les processus d'intégration du SID, les propriétés du SID, les mesures de qualité et les algorithmes d'évaluation de la qualité. En particulier, nous modélisons les processus d'intégration du SID comme des processus de workflow, dans lesquels les activités réalisent les tâches qui extraient, intègrent et envoient des données aux utilisateurs. Notre support de raisonnement pour l'évaluation de la qualité est un graphe acyclique dirigé, appelé graphe de qualité, qui a la même structure du SID et contient, comme étiquettes, les propriétés du SID qui sont relevants pour l'évaluation de la qualité. Nous développons des algorithmes d'évaluation qui prennent en entrée les valeurs de qualité des données sources et les propriétés du SID, et, combinent ces valeurs pour qualifier les données délivrées par le SID. Ils se basent sur la représentation en forme de graphe et combinent les valeurs des propriétés en traversant le graphe. Les algorithmes d'évaluation peuvent être spécialisés pour tenir compte des propriétés qui influent la qualité dans une application concrète. L'idée derrière le canevas est de définir un contexte flexible qui permet la spécialisation des algorithmes d'évaluation à des scénarios d'application spécifiques. Les valeurs de qualité obtenues pendant l'évaluation sont comparées à celles attendues par les utilisateurs. Des actions d'amélioration peuvent se réaliser si les exigences de qualité ne sont pas satisfaites. Nous suggérons des actions d'amélioration élémentaires qui peuvent être composées pour améliorer la qualité dans un SID concret. Notre approche pour améliorer la fraîcheur des données consiste à l'analyse du SID à différents niveaux d'abstraction, de façon à identifier ses points critiques et cibler l'application d'actions d'amélioration sur ces points-là. Notre approche pour améliorer l'exactitude des données consiste à partitionner les résultats des requêtes en portions (certains attributs, certaines tuples) ayant une exactitude homogène. Cela permet aux applications utilisateur de visualiser seulement les données les plus exactes, de filtrer les données ne satisfaisant pas les exigences d'exactitude ou de visualiser les données par tranche selon leur exactitude. Comparée aux approches existantes de sélection de sources, notre proposition permet de sélectionner les portions les plus exactes au lieu de filtrer des sources entières. Les contributions principales de cette thèse sont : (1) une analyse détaillée des facteurs de qualité fraîcheur et exactitude ; (2) la proposition de techniques et algorithmes pour l'évaluation et l'amélioration de la fraîcheur et l'exactitude des données ; et (3) un prototype d'évaluation de la qualité utilisable dans la conception de SID.

APA, Harvard, Vancouver, ISO, and other styles

35

Sánchez, Adam. "Big Data, Linked Data y Web semántica." Universidad Peruana de Ciencias Aplicadas (UPC), 2016. http://hdl.handle.net/10757/620705.

Full text

Abstract:

Conferencia realizada en el marco de la Semana del Acceso Abierto Perú, llevada a cabo del 24 al 26 de Octubre de 2016 en Lima, Peru. Las instituciones organizadoras: Universidad Peruana de Ciencias aplciadasd (UPC), Pontificia Universidad Católica del Perú (PUCP) y Universidad Peruana Cayetano Heredia (UPCH).
Conferencia que aborda aspectos del protocolo Linked Data, temas de Big Data y Web Semantica,

APA, Harvard, Vancouver, ISO, and other styles

36

Nyström, Simon, and Joakim Lönnegren. "Processing data sources with big data frameworks." Thesis, KTH, Data- och elektroteknik, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-188204.

Full text

Abstract:

Big data is a concept that is expanding rapidly. As more and more data is generatedand garnered, there is an increasing need for efficient solutions that can be utilized to process all this data in attempts to gain value from it. The purpose of this thesis is to find an efficient way to quickly process a large number of relatively small files. More specifically, the purpose is to test two frameworks that can be used for processing big data. The frameworks that are tested against each other are Apache NiFi and Apache Storm. A method is devised in order to, firstly, construct a data flow and secondly, construct a method for testing the performance and scalability of the frameworks running this data flow. The results reveal that Apache Storm is faster than Apache NiFi, at the sort of task that was tested. As the number of nodes included in the tests went up, the performance did not always do the same. This indicates that adding more nodes to a big data processing pipeline, does not always result in a better performing setup and that, sometimes, other measures must be made to heighten the performance.
Big data är ett koncept som växer snabbt. När mer och mer data genereras och samlas in finns det ett ökande behov av effektiva lösningar som kan användas föratt behandla all denna data, i försök att utvinna värde från den. Syftet med detta examensarbete är att hitta ett effektivt sätt att snabbt behandla ett stort antal filer, av relativt liten storlek. Mer specifikt så är det för att testa två ramverk som kan användas vid big data-behandling. De två ramverken som testas mot varandra är Apache NiFi och Apache Storm. En metod beskrivs för att, för det första, konstruera ett dataflöde och, för det andra, konstruera en metod för att testa prestandan och skalbarheten av de ramverk som kör dataflödet. Resultaten avslöjar att Apache Storm är snabbare än NiFi, på den typen av test som gjordes. När antalet noder som var med i testerna ökades, så ökade inte alltid prestandan. Detta visar att en ökning av antalet noder, i en big data-behandlingskedja, inte alltid leder till bättre prestanda och att det ibland krävs andra åtgärder för att öka prestandan.

APA, Harvard, Vancouver, ISO, and other styles

37

Li, Liangchun. "Web-based data visualization for data mining." Thesis, National Library of Canada = Bibliothèque nationale du Canada, 1998. http://www.collectionscanada.ca/obj/s4/f2/dsk2/ftp03/MQ35845.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

38

Tran, Viet-Trung. "Scalable data-management systems for Big Data." Phd thesis, École normale supérieure de Cachan - ENS Cachan, 2013. http://tel.archives-ouvertes.fr/tel-00920432.

Full text

Abstract:

Big Data can be characterized by 3 V's. * Big Volume refers to the unprecedented growth in the amount of data. * Big Velocity refers to the growth in the speed of moving data in and out management systems. * Big Variety refers to the growth in the number of different data formats. Managing Big Data requires fundamental changes in the architecture of data management systems. Data storage should continue being innovated in order to adapt to the growth of data. They need to be scalable while maintaining high performance regarding data accesses. This thesis focuses on building scalable data management systems for Big Data. Our first and second contributions address the challenge of providing efficient support for Big Volume of data in data-intensive high performance computing (HPC) environments. Particularly, we address the shortcoming of existing approaches to handle atomic, non-contiguous I/O operations in a scalable fashion. We propose and implement a versioning-based mechanism that can be leveraged to offer isolation for non-contiguous I/O without the need to perform expensive synchronizations. In the context of parallel array processing in HPC, we introduce Pyramid, a large-scale, array-oriented storage system. It revisits the physical organization of data in distributed storage systems for scalable performance. Pyramid favors multidimensional-aware data chunking, that closely matches the access patterns generated by applications. Pyramid also favors a distributed metadata management and a versioning concurrency control to eliminate synchronizations in concurrency. Our third contribution addresses Big Volume at the scale of the geographically distributed environments. We consider BlobSeer, a distributed versioning-oriented data management service, and we propose BlobSeer-WAN, an extension of BlobSeer optimized for such geographically distributed environments. BlobSeer-WAN takes into account the latency hierarchy by favoring locally metadata accesses. BlobSeer-WAN features asynchronous metadata replication and a vector-clock implementation for collision resolution. To cope with the Big Velocity characteristic of Big Data, our last contribution feautures DStore, an in-memory document-oriented store that scale vertically by leveraging large memory capability in multicore machines. DStore demonstrates fast and atomic complex transaction processing in data writing, while maintaining high throughput read access. DStore follows a single-threaded execution model to execute update transactions sequentially, while relying on a versioning concurrency control to enable a large number of simultaneous readers.

APA, Harvard, Vancouver, ISO, and other styles

39

Zebbiche, K. "Data Hiding for Securing Fingerprint Data Access." Thesis, Queen's University Belfast, 2010. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.517622.

Full text

APA, Harvard, Vancouver, ISO, and other styles

40

Jägerhult, Fjelberg Marianne. "Predicting data traffic in cellular data networks." Thesis, KTH, Matematisk statistik, 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-169388.

Full text

Abstract:

The exponential increase in cellular data usage in recent time is evident, which introduces challenges and opportunities for the telecom industry. From a Radio Resource Management perspective, it is therefore most valuable to be able to predict future events such as user load. The objective of this thesis is thus to investigate whether one can predict such future events based on information available in a base station. This is done by clustering data obtained from a simulated 4G network using Gaussian Mixture Models. Based on this, an evaluation based on the cluster signatures is performed, where heavy-load users seem to be identified. Furthermore, other evaluations on other temporal aspects tied to the clusters and cluster transitions is performed. Secondly, supervised classification using Random Forest is performed, in order to investigate whether prediction of these cluster labels is possible. High accuracies for most of these classifications are obtained, suggesting that prediction based on these methods can be made.

APA, Harvard, Vancouver, ISO, and other styles

41

Ross, Colin. "Applications of data fusion in data approximation." Thesis, University of Huddersfield, 2002. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.247372.

Full text

APA, Harvard, Vancouver, ISO, and other styles

42

Conlin, Adrian Keith. "Complex sensor data analysis through data augmentation." Thesis, University of Newcastle Upon Tyne, 1996. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.320016.

Full text

APA, Harvard, Vancouver, ISO, and other styles

43

Cao, Yang. "Querying big data with bounded data access." Thesis, University of Edinburgh, 2016. http://hdl.handle.net/1842/25421.

Full text

Abstract:

Query answering over big data is cost-prohibitive. A linear scan of a dataset D may take days with a solid state device if D is of PB size and years if D is of EB size. In other words, polynomial-time (PTIME) algorithms for query evaluation are already not feasible on big data. To tackle this, we propose querying big data with bounded data access, such that the cost of query evaluation is independent of the scale of D. First of all, we propose a class of boundedly evaluable queries. A query Q is boundedly evaluable under a set A of access constraints if for any dataset D that satisfies constraints in A, there exists a subset DQ ⊆ D such that (a) Q(DQ) = Q(D), and (b) the time for identifying DQ from D, and hence the size |DQ| of DQ, are independent of |D|. That is, we can compute Q(D) by accessing a bounded amount of data no matter how big D grows.We study the problem of deciding whether a query is boundedly evaluable under A. It is known that the problem is undecidable for FO without access constraints. We show that, in the presence of access constraints, it is decidable in 2EXPSPACE for positive fragments of FO queries, but is already EXPSPACE-hard even for CQ. To handle the undecidability and high complexity of the analysis, we develop effective syntax for boundedly evaluable queries under A, referred to as queries covered by A, such that, (a) any boundedly evaluable query under A is equivalent to a query covered by A, (b) each covered query is boundedly evaluable, and (c) it is efficient to decide whether Q is covered by A. On top of DBMS, we develop practical algorithms for checking whether queries are covered by A, and generating bounded plans if so. For queries that are not boundedly evaluable, we extend bounded evaluability to resource-bounded approximation and bounded query rewriting using views. (1) Resource-bounded approximation is parameterized with a resource ratio a ∈ (0,1], such that for any query Q and dataset D, it computes approximate answers with an accuracy bound h by accessing at most a|D| tuples. It is based on extended access constraints and a new accuracy measure. (2) Bounded query rewriting tackles the problem by incorporating bounded evaluability with views, such that the queries can be exactly answered by accessing cached views and a bounded amount of data in D. We study the problem of deciding whether a query has a bounded rewriting, establish its complexity bounds, and develop effective syntax for FO queries with a bounded rewriting. Finally, we extend bounded evaluability to graph pattern queries, by extending access constraints to graph data. We characterize bounded evaluability for subgraph and simulation patterns and develop practical algorithms for associated problems.

APA, Harvard, Vancouver, ISO, and other styles

44

Mueller, G. "Data Consistency Checks on Flight Test Data." International Foundation for Telemetering, 2014. http://hdl.handle.net/10150/577405.

Full text

Abstract:

ITC/USA 2014 Conference Proceedings / The Fiftieth Annual International Telemetering Conference and Technical Exhibition / October 20-23, 2014 / Town and Country Resort & Convention Center, San Diego, CA
This paper reflects the principal results of a study performed internally by Airbus's flight test centers. The purpose of this study was to share the body of knowledge concerning data consistency checks between all Airbus business units. An analysis of the test process is followed by the identification of the process stakeholders involved in ensuring data consistency. In the main part of the paper several different possibilities for improving data consistency are listed; it is left to the discretion of the reader to determine the appropriateness these methods.

APA, Harvard, Vancouver, ISO, and other styles

45

Fitzgerald, Alan. "DATA STORAGE SUITED TO FLIGHT DATA RECORDERS." International Foundation for Telemetering, 2004. http://hdl.handle.net/10150/605266.

Full text

APA, Harvard, Vancouver, ISO, and other styles

46

Oladele, Kazeem Ayinde. "Investigating pluralistic data architectures in data warehousing." Thesis, Brunel University, 2015. http://bura.brunel.ac.uk/handle/2438/10534.

Full text

Abstract:

Understanding and managing change is a strategic objective for many organisations to successfully compete in a market place; as a result, organisations are leveraging their data asset and implementing data warehouses to gain business intelligence necessary to improve their businesses. Data warehouses are expensive initiatives, one-half to two-thirds of most data warehousing efforts end in failure. In the absence of well-formalised design methodology in the industry and in the context of the debate on data architecture in data warehousing, this thesis examines why multidimensional and relational data models define the data architecture landscape in the industry. The study develops a number of propositions from the literature and empirical data to understand the factors impacting the choice of logical data model in data warehousing. Using a comparative case study method as the mean of collecting empirical data from the case organisations, the research proposes a conceptual model for logical data model adoption. The model provides a framework that guides decision making for adopting a logical data model for a data warehouse. The research conceptual model identifies the characteristics of business requirements and decision pathways for multidimensional and relational data warehouses. The conceptual model adds value by identifying the business requirements which a multidimensional and relational logical data model is empirically applicable.

APA, Harvard, Vancouver, ISO, and other styles

47

Peralta, Costabel Veronika del Carmen. "Data quality evaluation in data integration systems." Versailles-St Quentin en Yvelines, 2006. http://www.theses.fr/2006VERS0020.

Full text

Abstract:

This thesis deals with data quality evaluation in Data Integration Systems (DIS). Specifically, we address the problems of evaluating the quality of the data conveyed to users in response to their queries and verifying if users’ quality expectations can be achieved. We also analyze how quality measures can be used for improving the DIS and enforcing data quality. Our approach consists in studying one quality factor at a time, analyzing its impact within a DIS, proposing techniques for its evaluation and proposing improvement actions for its enforcement. Among the quality factors that have been proposed, this thesis analyzes two of the most used ones: data freshness and data accuracy
Cette thèse porte sur la qualité des données dans les Systèmes d’Intégration de Données (SID). Nous nous intéressons, plus précisément, aux problèmes de l’évaluation de la qualité des données délivrées aux utilisateurs en réponse à leurs requêtes et de la satisfaction des exigences des utilisateurs en terme de qualité. Nous analysons également l’utilisation de mesures de qualité pour l’amélioration de la conception du SID et la conséquente amélioration de la qualité des données. Notre approche consiste à étudier un facteur de qualité à la fois, en analysant sa relation avec le SID, en proposant des techniques pour son évaluation et en proposant des actions pour son amélioration. Parmi les facteurs de qualité qui ont été proposés, cette thèse analyse deux facteurs de qualité : la fraîcheur et l’exactitude des données

APA, Harvard, Vancouver, ISO, and other styles

48

Zhang, Yihua. "ON DATA UTILITY IN PRIVATE DATA PUBLISHING." Miami University / OhioLINK, 2010. http://rave.ohiolink.edu/etdc/view?acc_num=miami1272986770.

Full text

APA, Harvard, Vancouver, ISO, and other styles

49

Modur, Sharada P. "Missing Data Methods for Clustered Longitudinal Data." The Ohio State University, 2010. http://rave.ohiolink.edu/etdc/view?acc_num=osu1274876785.

Full text

APA, Harvard, Vancouver, ISO, and other styles

50

Zha, Xiao. "Topological Data Analysis on Road Network Data." The Ohio State University, 2019. http://rave.ohiolink.edu/etdc/view?acc_num=osu155563664988436.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Data'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles