Dissertations / Theses on the topic 'Big data with missingness'

To see the other types of publications on this topic, follow the link: Big data with missingness.

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 dissertations / theses for your research on the topic 'Big data with missingness.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Cao, Yu. "Bayesian nonparametric analysis of longitudinal data with non-ignorable non-monotone missingness." VCU Scholars Compass, 2019. https://scholarscompass.vcu.edu/etd/5750.

Full text
Abstract:
In longitudinal studies, outcomes are measured repeatedly over time, but in reality clinical studies are full of missing data points of monotone and non-monotone nature. Often this missingness is related to the unobserved data so that it is non-ignorable. In such context, pattern-mixture model (PMM) is one popular tool to analyze the joint distribution of outcome and missingness patterns. Then the unobserved outcomes are imputed using the distribution of observed outcomes, conditioned on missing patterns. However, the existing methods suffer from model identification issues if data is sparse in specific missing patterns, which is very likely to happen with a small sample size or a large number of repetitions. We extend the existing methods using latent class analysis (LCA) and a shared-parameter PMM. The LCA groups patterns of missingness with similar features and the shared-parameter PMM allows a subset of parameters to be different among latent classes when fitting a model, thus restoring model identifiability. A novel imputation method is also developed using the distribution of observed data conditioned on latent classes. We develop this model for continuous response data and extend it to handle ordinal rating scale data. Our model performs better than existing methods for data with small sample size. The method is applied to two datasets from a phase II clinical trial that studies the quality of life for patients with prostate cancer receiving radiation therapy, and another to study the relationship between the perceived neighborhood condition in adolescence and the drinking habit in adulthood.
APA, Harvard, Vancouver, ISO, and other styles
2

Hansen, Simon, and Erik Markow. "Big Data : Implementation av Big Data i offentlig verksamhet." Thesis, Högskolan i Halmstad, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:hh:diva-38756.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Deng, Wei. "Multiple imputation for marginal and mixed models in longitudinal data with informative missingness." Connect to resource, 2005. http://rave.ohiolink.edu/etdc/view?acc%5Fnum=osu1126890027.

Full text
Abstract:
Thesis (Ph. D.)--Ohio State University, 2005.
Title from first page of PDF file. Document formatted into pages; contains xiii, 108 p.; also includes graphics. Includes bibliographical references (p. 104-108). Available online via OhioLINK's ETD Center
APA, Harvard, Vancouver, ISO, and other styles
4

Lundvall, Helena. "Big data = Big money? : En kvantitativ studie om big data, förtroende och köp online." Thesis, Uppsala universitet, Företagsekonomiska institutionen, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-451065.

Full text
Abstract:
Tidigare forskning har entydigt visat på att ett ökat förtroende hos kunder i köpsituationer ökar deras vilja att genomföra köp. Vilka faktorer som påverkar kunders förtroende har även det undersökts flitigt och faktorer som kan kopplas till hantering av kunders data tas allt oftare upp som avgörande. Dock behandlas dessa faktorer många gånger på ett övergripande plan och studier som djupdyker i vilka underliggande faktorer kopplat till datahantering som påverkar kunders förtroende saknas. Genom att samla in kvantitativ data om hur kunder förhåller sig till företags insamling och användande av big data, deras förtroende för e-handelsföretag, samt deras vilja att genomföra köp online ämnar denna studie till att besvara syftet att undersöka effekten av företags insamling och användande av big data på kunders förtroende för företag inom e-handel, samt att undersöka effekten av kunders förtroende på deras vilja att genomföra köp. Studiens resultat visar att företags insamling av big data har en signifikant negativ effekt på kundernas förtroende, samt att kunders förtroende har ett signifikant positivt samband med kunders köpintention. Gällande företags användande av big data kunde däremot inte en signifikant negativ effekt på kundernas förtroende påvisas.
APA, Harvard, Vancouver, ISO, and other styles
5

Rizk, Raya. "Big Data Validation." Thesis, Uppsala universitet, Informationssystem, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-353850.

Full text
Abstract:
With the explosion in usage of big data, stakes are high for companies to develop workflows that translate the data into business value. Those data transformations are continuously updated and refined in order to meet the evolving business needs, and it is imperative to ensure that a new version of a workflow still produces the correct output. This study focuses on the validation of big data in a real-world scenario, and implements a validation tool that compares two databases that hold the results produced by different versions of a workflow in order to detect and prevent potential unwanted alterations, with row-based and column-based statistics being used to validate the two versions. The tool was shown to provide accurate results in test scenarios, providing leverage to companies that need to validate the outputs of the workflows. In addition, by automating this process, the risk of human error is eliminated, and it has the added benefit of improved speed compared to the more labour-intensive manual alternative. All this allows for a more agile way of performing updates on the data transformation workflows by improving on the turnaround time of the validation process.
APA, Harvard, Vancouver, ISO, and other styles
6

Jaber, Carolin. "Big data visualisering." Thesis, Örebro universitet, Institutionen för naturvetenskap och teknik, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:oru:diva-79898.

Full text
Abstract:
Visualisering av data i grafiska presentationer är viktigt inom många olika områden för attenklare förstå information och relationer av insamlad data. Mängden data växer snabbt tillstora skalor som är svåra att hantera och bidrar till nya utmaningar vid visualisering av data igrafiska presentationer. System är beroende av data visualisering för att upptäcka defekteroch fel av produktion. Genom att förbättra prestandan av tidsseriedata visualisering ökar detmöjligheten att upptäcka fel och defekter av produktion.Rapporten tar upp metoder för visualisering av tidsseriedata med snabb prestanda ochdiskuterar hur Big data av multivaribler kan visualiseras med PCA.
Presenting data in graphical forms is important in many different industries in order tounderstand information asset from data that is being collected. The amount of data is growingfast and brings new challenges for visualizing the data in graphical representations. Systemsare dependent on data visualization for detecting defects and faults of productions. Byimproved performance of time series data visualization increases the ability of detectingfaults and defects of productions.This report takes up a methods for visualizing time series data with high velocity in toaccount and discusses how big data of multivariable can be visualized with PCA.
APA, Harvard, Vancouver, ISO, and other styles
7

Blahová, Leontýna. "Big Data Governance." Master's thesis, Vysoká škola ekonomická v Praze, 2016. http://www.nusl.cz/ntk/nusl-203994.

Full text
Abstract:
This master thesis is about Big Data Governance and about software, which is used for this purposes. Because Big Data are huge opportunity and also risk, I wanted to map products which can be easily use for Data Quality and Big Data Governance in one platform. This thesis is not only on theoretical knowledge level, but also evaluates five key products (from my point of view). I defined requirements for every kind of domain and then I set up the weights and points. The main objective is to evaluate software capabilities and compere them.
APA, Harvard, Vancouver, ISO, and other styles
8

Kämpe, Gabriella. "How Big Data Affects UserExperienceReducing cognitive load in big data applications." Thesis, Umeå universitet, Institutionen för datavetenskap, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-163995.

Full text
Abstract:
We have entered the age of big data. Massive data sets are common in enterprises, government, and academia. Interpreting such scales of data is still hard for the human mind. This thesis investigates how proper design can decrease the cognitive load in data-heavy applications. It focuses on numeric data describing economic growth in retail organizations. It aims to answer the questions: What is important to keep in mind when designing an interface that holds large amounts of data? and How to decrease the cognitive load in complex user interfaces without reducing functionality?. It aims to answer these questions by comparing two user interfaces in terms of efficiency, structure, ease of use and navigation. Each interface holds the same functionality and amount of data, but one is designed to increase user experience by reducing cognitive load. The design choices in the second application are based on the theory found in the literature study in the thesis.
APA, Harvard, Vancouver, ISO, and other styles
9

Hafez, Mai. "Analysis of multivariate longitudinal categorical data subject to nonrandom missingness : a latent variable approach." Thesis, London School of Economics and Political Science (University of London), 2015. http://etheses.lse.ac.uk/3184/.

Full text
Abstract:
Longitudinal data are collected for studying changes across time. In social sciences, interest is often in theoretical constructs, such as attitudes, behaviour or abilities, which cannot be directly measured. In that case, multiple related manifest (observed) variables, for example survey questions or items in an ability test, are used as indicators for the constructs, which are themselves treated as latent (unobserved) variables. In this thesis, multivariate longitudinal data is considered where multiple observed variables, measured at each time point, are used as indicators for theoretical constructs (latent variables) of interest. The observed items and the latent variables are linked together via statistical latent variable models. A common problem in longitudinal studies is missing data, where missingness can be classiffed into one of two forms. Dropout occurs when subjects exit the study prematurely, while intermittent missingness takes place when subjects miss one or more occasions but show up on a subsequent wave of the study. Ignoring the missingness mechanism can lead to biased estimates, especially when the missingness is nonrandom. The approach proposed in this thesis uses latent variable models to capture the evolution of a latent phenomenon over time, while incorporating a missingness mechanism to account for possibly nonrandom forms of missingness. Two model specifications are presented, the first of which incorporates dropout only in the missingness mechanism, while the other accounts for both dropout and intermittent missingness allowing them to be informative by being modelled as functions of the latent variables and possibly observed covariates. Models developed in this thesis consider ordinal and binary observed items, because such variables are often met in social surveys, while the underlying latent variables are assumed to be continuous. The proposed models are illustrated by analysing people's perceptions on women's work using three questions from five waves of the British Household Panel Survey.
APA, Harvard, Vancouver, ISO, and other styles
10

Andersson, Oscar, and Tim Andersson. "AI applications on healthcare data." Thesis, Högskolan i Halmstad, Akademin för informationsteknologi, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:hh:diva-44752.

Full text
Abstract:
The purpose of this research is to get a better understanding of how different machine learning algorithms work with different amounts of data corruption. This is important since data corruption is an overbearing issue within data collection and thus, in extension, any work that relies on the collected data. The questions we were looking at were: What feature is the most important? How significant is the correlation of features? What algorithms should be used given the data available? And, How much noise (inaccurate or unhelpful captured data) is acceptable?  The study is structured to introduce AI in healthcare, data missingness, and the machine learning algorithms we used in the study. In the method section, we give a recommended workflow for handling data with machine learning in mind. The results show us that when a dataset is filled with random values, the run-time of algorithms increases since many patterns are lost. Randomly removing values also caused less of a problem than first anticipated since we ran multiple trials, evening out any problems caused by the lost values. Lastly, imputation is a preferred way of handling missing data since it retained many dataset structures. One has to keep in mind if the imputation is done on categories or numerical values. However, there is no easy "best-fit" for any dataset. It is hard to give a concrete answer when choosing a machine learning algorithm that fits any dataset. Nevertheless, since it is easy to simply plug-and-play with many algorithms, we would recommend any user try different ones before deciding which one fits a project the best.
APA, Harvard, Vancouver, ISO, and other styles
11

Sherikar, Vishnu Vardhan Reddy. "I2MAPREDUCE: DATA MINING FOR BIG DATA." CSUSB ScholarWorks, 2017. https://scholarworks.lib.csusb.edu/etd/437.

Full text
Abstract:
This project is an extension of i2MapReduce: Incremental MapReduce for Mining Evolving Big Data . i2MapReduce is used for incremental big data processing, which uses a fine-grained incremental engine, a general purpose iterative model that includes iteration algorithms such as PageRank, Fuzzy-C-Means(FCM), Generalized Iterated Matrix-Vector Multiplication(GIM-V), Single Source Shortest Path(SSSP). The main purpose of this project is to reduce input/output overhead, to avoid incurring the cost of re-computation and avoid stale data mining results. Finally, the performance of i2MapReduce is analyzed by comparing the resultant graphs.
APA, Harvard, Vancouver, ISO, and other styles
12

Giordano, Manfredi. "Autonomic Big Data Processing." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2017. http://amslaurea.unibo.it/14837/.

Full text
Abstract:
Apache Spark è un framework open source per la computazione distribuita su larga scala, caratterizzato da un engine in-memory che permette prestazioni superiori a soluzioni concorrenti nell’elaborazione di dati a riposo (batch) o in movimento (streaming). In questo lavoro presenteremo alcune tecniche progettate e implementate per migliorare l’elasticità e l’adattabilità del framework rispetto a modifiche dinamiche nell’ambiente di esecuzione o nel workload. Lo scopo primario di tali tecniche è di permettere ad applicazioni concorrenti di condividere le risorse fisiche disponibili nell’infrastruttura cluster sottostante in modo efficiente. Il contesto nel quale le applicazioni distribuite vengono eseguite difficilmente può essere considerato statico: le componenti hardware possono fallire, i processi possono interrompersi, gli utenti possono allocare risorse aggiuntive in modo imprevedibile nel tentativo di accelerare la computazione o di allegerire il carico di lavoro. Infine, non soltanto le risorse fisiche ma anche i dati in input possono variare di dimensione e complessità durante l’esecuzione, così che sia dati sia risorse non possano essere considerati statici. Una configurazione immutabile del cluster non riuscirà a ottenere la migliore efficienza possibile per tutti i differenti carichi di lavoro. Ne consegue che un framework per il calcolo distribuito che sia "consapevole" delle modifiche ambientali e delle modifiche al workload e che sia in grado di adattarsi a esse puo risultare piu performante di un framework che permetta unicamente configurazioni statiche. Gli esperimenti da noi compiuti con applicazioni Big Data altamente parallelizzabili mostrano come il costo della soluzione proposta sia minimo e come la nostra version di Spark più dinamica e adattiva possa portare a benefici in termini di flessibilità, scalabilità ed efficienza.
APA, Harvard, Vancouver, ISO, and other styles
13

Francke, Angela, and Sven Lißner. "Big Data im Radverkehr." Saechsische Landesbibliothek- Staats- und Universitaetsbibliothek Dresden, 2018. http://nbn-resolving.de/urn:nbn:de:bsz:14-qucosa-230730.

Full text
Abstract:
Für einen attraktiven Radverkehr bedarf es einer qualitativ hochwertigen Infrastruktur. Bisher liegen durch den hohen Aufwand von Vor-Ort-Erfassungen nur punktuelle Radverkehrsstärken vor. Die aktuell wohl zuverlässigsten und tauglichsten Werte liefern bisher fest installierte automatische Radverkehrszählstellen, wie sie bereits viele Kommunen installiert haben. Ein Nachteil ist hierbei, dass für eine flächige Abdeckung mit einer besseren Aussagekraft für die gesamte Stadt oder Kommune die Anzahl der Erhebungspunkte meist deutlich zu gering ist. Die Bedeutung des Nebennetzes für den Radverkehr wird somit nur unvollständig erfasst. Für weitere Parameter, wie Wartezeiten, Routenwahl oder Geschwindigkeiten der Radfahrenden, fehlen dagegen meist die Daten. Perspektivisch kann diese Lücke unter anderem durch GPS-Routendaten gefüllt werden, was durch die mittlerweile sehr hohe Verbreitung von Smartphones und den entsprechenden Tracking-Apps ermöglicht wird. Die Ergebnisse des im Leitfaden vorgestellten Projektes sind durch das BMVI im Rahmen des Nationalen Radverkehrsplans 2020 gefördert wurden. Das Forschungsprojekt untersucht dabei die Nutzbarkeit von mit Smartphones generierten Nutzerdaten einer App für die kommunale Radverkehrsplanung. Zusammenfassend lässt sich sagen, dass unter Beachtung der im folgenden Leitfaden beschriebenen Faktoren GPS-Daten, im vorliegenden Fall die der Firma Strava Inc., mit einigen Einschränkungen für die Radverkehrsplanung nutzbar sind. Bereits heute sind damit Auswertungen möglich, die zeigen, wo, wann und wie sich Radfahrende im gesamten Netz bewegen. Die mittels Smartphone-App generierten Daten können sehr sinnvoll als Ergänzung zu bestehenden Dauerzählstellen von Kommunen genutzt werden. Berücksichtigt werden sollten bei der Auswertung und Interpretation der Daten jedoch einige Aspekte, wie der eher sportlich orientierte Kontext der erfassten Routen in den untersuchten Beispielen. Des Weiteren werden aktuell die Daten zum Teil noch als Datenbank- oder GIS-Dateien zur Verfügung gestellt, bzw. befinden sich online Masken zur einfacheren Nutzung im Aufbau oder einem ersten Nutzungsstadium. Die Auswertung und Interpretation erfordert also weiterhin Fachkompetenz und auch personelle Ressourcen. Der Einsatz dieser sinkt jedoch voraussichtlich zukünftig durch die Weiterentwicklung von Web-Oberflächen und unterstützenden Auswertemasken. Hier gilt es zukünftig, in Zusammenarbeit mit den Kommunen, die benötigten Parameter sowie die geeignetsten Aufbereitungsformen zu erarbeiten. Im Forschungsprojekt erfolgte ein Ansatz der Hochrechnung von Radverkehrsstärken aus Stichproben von GPS-Daten auf das gesamte Netz. Dieser konnte auch erfolgreich in einer weiteren Kommune verifiziert werden. Jedoch ist auch hier in Zukunft noch Forschungsbedarf vorhanden bzw. die Anpassung auf lokale Gegebenheiten notwendig. In naher Zukunft ist es notwendig, den Praxisnachweis für die Nutzbarkeit von GPS-Daten zu erbringen. Vorbilder hierfür können die Städte Bremen, Dresden, Leipzig oder Mainz sein, die jeweils bereits erste Schritte zur Nutzung von GPS-Daten in der Radverkehrsplanung und -förderung unternehmen. Diese Schritte sind vor dem Hintergrund der weiteren Digitalisierung von Mobilität und Verkehrsmitteln und dem damit wachsenden Datenangebot – auch trotz der bisherigen Einschränkungen der Daten – sinnvoll, um in den Verwaltungen frühzeitig entsprechende Kompetenzen aufzubauen. Langfristig bietet die Nutzung von GPS-Daten einen Mehrwert für die Radverkehrsplanung. Der aktive Einbezug von Radfahrenden eröffnet zudem neue Möglichkeiten in der Kommunikation und der Bürgerbeteiligung – auch ohne Fachwissen vorauszusetzen. Der vorliegende Leitfaden liefert dafür einen praxisorientierten Einstieg in das Thema und weist umfassend auf Angebote, Hindernisse und Potenziale von GPS-Daten hin.
APA, Harvard, Vancouver, ISO, and other styles
14

Santos, Lúcio Fernandes Dutra. "Similaridade em big data." Universidade de São Paulo, 2017. http://www.teses.usp.br/teses/disponiveis/55/55134/tde-07022018-104929/.

Full text
Abstract:
Os volumes de dados armazenados em grandes bases de dados aumentam em ritmo sempre crescente, pressionando o desempenho e a flexibilidade dos Sistemas de Gerenciamento de Bases de Dados (SGBDs). Os problemas de se tratar dados em grandes quantidades, escopo, complexidade e distribuição vêm sendo tratados também sob o tema de big data. O aumento da complexidade cria a necessidade de novas formas de busca - representar apenas números e pequenas cadeias de caracteres já não é mais suficiente. Buscas por similaridade vêm se mostrando a maneira por excelência de comparar dados complexos, mas até recentemente elas não estavam disponíveis nos SGBDs. Agora, com o início de sua disponibilidade, está se tornando claro que apenas os operadores de busca por similaridade fundamentais não são suficientes para lidar com grandes volumes de dados. Um dos motivos disso é que similaridade\' é, usualmente, definida considerando seu significado quando apenas poucos estão envolvidos. Atualmente, o principal foco da literatura em big data é aumentar a eficiência na recuperação dos dados usando paralelismo, existindo poucos estudos sobre a eficácia das respostas obtidas. Esta tese visa propor e desenvolver variações dos operadores de busca por similaridade para torná-los mais adequados para processar big data, apresentando visões mais abrangentes da base de dados, aumentando a eficácia das respostas, porém sem causar impactos consideráveis na eficiência dos algoritmos de busca e viabilizando sua execução escalável sobre grandes volumes de dados. Para alcançar esse objetivo, este trabalho apresenta quatro frentes de contribuições: A primeira consistiu em um modelo de diversificação de resultados que pode ser aplicado usando qualquer critério de comparação e operador de busca por similaridade. A segunda focou em definir técnicas de amostragem e de agrupamento de dados com o modelo de diversificação proposto, acelerando o processo de análise dos conjuntos de resultados. A terceira contribuição desenvolveu métodos de avaliação da qualidade dos conjuntos de resultados diversificados. Por fim, a última frente de contribuição apresentou uma abordagem para integrar os conceitos de mineração visual de dados e buscas por similaridade com diversidade em sistemas de recuperação por conteúdo, aumentando o entendimento de como a propriedade de diversidade pode ser aplicada.
The data being collected and generated nowadays increase not only in volume, but also in complexity, requiring new query operators. Health care centers collecting image exams and remote sensing from satellites and from earth-based stations are examples of application domains where more powerful and flexible operators are required. Storing, retrieving and analyzing data that are huge in volume, structure, complexity and distribution are now being referred to as big data. Representing and querying big data using only the traditional scalar data types are not enough anymore. Similarity queries are the most pursued resources to retrieve complex data, but until recently, they were not available in the Database Management Systems. Now that they are starting to become available, its first uses to develop real systems make it clear that the basic similarity query operators are not enough to meet the requirements of the target applications. The main reason is that similarity is a concept formulated considering only small amounts of data elements. Nowadays, researchers are targeting handling big data mainly using parallel architectures, and only a few studies exist targeting the efficacy of the query answers. This Ph.D. work aims at developing variations for the basic similarity operators to propose better suited similarity operators to handle big data, presenting a holistic vision about the database, increasing the effectiveness of the provided answers, but without causing impact on the efficiency on the searching algorithms. To achieve this goal, four mainly contributions are presented: The first one was a result diversification model that can be applied in any comparison criteria and similarity search operator. The second one focused on defining sampling and grouping techniques with the proposed diversification model aiming at speeding up the analysis task of the result sets. The third contribution concentrated on evaluation methods for measuring the quality of diversified result sets. Finally, the last one defines an approach to integrate the concepts of visual data mining and similarity with diversity searches in content-based retrieval systems, allowing a better understanding of how the diversity property is applied in the query process.
APA, Harvard, Vancouver, ISO, and other styles
15

Francke, Angela, and Sven Lißner. "Big Data im Radverkehr." Technische Universität Dresden, 2017. https://tud.qucosa.de/id/qucosa%3A29637.

Full text
Abstract:
Für einen attraktiven Radverkehr bedarf es einer qualitativ hochwertigen Infrastruktur. Bisher liegen durch den hohen Aufwand von Vor-Ort-Erfassungen nur punktuelle Radverkehrsstärken vor. Die aktuell wohl zuverlässigsten und tauglichsten Werte liefern bisher fest installierte automatische Radverkehrszählstellen, wie sie bereits viele Kommunen installiert haben. Ein Nachteil ist hierbei, dass für eine flächige Abdeckung mit einer besseren Aussagekraft für die gesamte Stadt oder Kommune die Anzahl der Erhebungspunkte meist deutlich zu gering ist. Die Bedeutung des Nebennetzes für den Radverkehr wird somit nur unvollständig erfasst. Für weitere Parameter, wie Wartezeiten, Routenwahl oder Geschwindigkeiten der Radfahrenden, fehlen dagegen meist die Daten. Perspektivisch kann diese Lücke unter anderem durch GPS-Routendaten gefüllt werden, was durch die mittlerweile sehr hohe Verbreitung von Smartphones und den entsprechenden Tracking-Apps ermöglicht wird. Die Ergebnisse des im Leitfaden vorgestellten Projektes sind durch das BMVI im Rahmen des Nationalen Radverkehrsplans 2020 gefördert wurden. Das Forschungsprojekt untersucht dabei die Nutzbarkeit von mit Smartphones generierten Nutzerdaten einer App für die kommunale Radverkehrsplanung. Zusammenfassend lässt sich sagen, dass unter Beachtung der im folgenden Leitfaden beschriebenen Faktoren GPS-Daten, im vorliegenden Fall die der Firma Strava Inc., mit einigen Einschränkungen für die Radverkehrsplanung nutzbar sind. Bereits heute sind damit Auswertungen möglich, die zeigen, wo, wann und wie sich Radfahrende im gesamten Netz bewegen. Die mittels Smartphone-App generierten Daten können sehr sinnvoll als Ergänzung zu bestehenden Dauerzählstellen von Kommunen genutzt werden. Berücksichtigt werden sollten bei der Auswertung und Interpretation der Daten jedoch einige Aspekte, wie der eher sportlich orientierte Kontext der erfassten Routen in den untersuchten Beispielen. Des Weiteren werden aktuell die Daten zum Teil noch als Datenbank- oder GIS-Dateien zur Verfügung gestellt, bzw. befinden sich online Masken zur einfacheren Nutzung im Aufbau oder einem ersten Nutzungsstadium. Die Auswertung und Interpretation erfordert also weiterhin Fachkompetenz und auch personelle Ressourcen. Der Einsatz dieser sinkt jedoch voraussichtlich zukünftig durch die Weiterentwicklung von Web-Oberflächen und unterstützenden Auswertemasken. Hier gilt es zukünftig, in Zusammenarbeit mit den Kommunen, die benötigten Parameter sowie die geeignetsten Aufbereitungsformen zu erarbeiten. Im Forschungsprojekt erfolgte ein Ansatz der Hochrechnung von Radverkehrsstärken aus Stichproben von GPS-Daten auf das gesamte Netz. Dieser konnte auch erfolgreich in einer weiteren Kommune verifiziert werden. Jedoch ist auch hier in Zukunft noch Forschungsbedarf vorhanden bzw. die Anpassung auf lokale Gegebenheiten notwendig. In naher Zukunft ist es notwendig, den Praxisnachweis für die Nutzbarkeit von GPS-Daten zu erbringen. Vorbilder hierfür können die Städte Bremen, Dresden, Leipzig oder Mainz sein, die jeweils bereits erste Schritte zur Nutzung von GPS-Daten in der Radverkehrsplanung und -förderung unternehmen. Diese Schritte sind vor dem Hintergrund der weiteren Digitalisierung von Mobilität und Verkehrsmitteln und dem damit wachsenden Datenangebot – auch trotz der bisherigen Einschränkungen der Daten – sinnvoll, um in den Verwaltungen frühzeitig entsprechende Kompetenzen aufzubauen. Langfristig bietet die Nutzung von GPS-Daten einen Mehrwert für die Radverkehrsplanung. Der aktive Einbezug von Radfahrenden eröffnet zudem neue Möglichkeiten in der Kommunikation und der Bürgerbeteiligung – auch ohne Fachwissen vorauszusetzen. Der vorliegende Leitfaden liefert dafür einen praxisorientierten Einstieg in das Thema und weist umfassend auf Angebote, Hindernisse und Potenziale von GPS-Daten hin.
APA, Harvard, Vancouver, ISO, and other styles
16

Виноградова, О. В. "Використання Big Data компаніями." Thesis, Київський національний універститет технологій та дизайну, 2017. https://er.knutd.edu.ua/handle/123456789/10417.

Full text
APA, Harvard, Vancouver, ISO, and other styles
17

Blaho, Matúš. "Aplikace pro Big Data." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2018. http://www.nusl.cz/ntk/nusl-385977.

Full text
Abstract:
This work deals with the description and analysis of the Big Data concept and its processing and use in the process of decision support. Suggested processing is based on the MapReduce concept designed for Big Data processing. The theoretical part of this work is largely about the Hadoop system that implements this concept. Its understanding is a key feature for properly designing applications that run within it. The work also contains design for specific Big Data processing applications. In the implementation part of the thesis is a description of Hadoop system management, description of implementation of MapReduce applications and description of their testing over data sets.
APA, Harvard, Vancouver, ISO, and other styles
18

Flike, Felix, and Markus Gervard. "BIG DATA-ANALYS INOM FOTBOLLSORGANISATIONER En studie om big data-analys och värdeskapande." Thesis, Malmö universitet, Fakulteten för teknik och samhälle (TS), 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:mau:diva-20117.

Full text
Abstract:
Big data är ett relativt nytt begrepp men fenomenet har funnits länge. Det går att beskriva utifrån fem V:n; volume, veracity, variety, velocity och value. Analysen av Big Data har kommit att visa sig värdefull för organisationer i arbetet med beslutsfattande, generering av mätbara ekonomiska fördelar och förbättra verksamheten. Inom idrottsbranschen började detta på allvar användas i början av 2000-talet i baseballorganisationen Oakland Athletics. Man började värva spelare baserat på deras statistik istället för hur bra scouterna bedömde deras förmåga vilket gav stora framgångar. Detta ledde till att fler organisationer tog efter och det har inte dröjt länge innan Big Data-analys används i alla stora sporter för att vinna fördelar gentemot konkurrenter. I svensk kontext så är användningen av dessa verktyg fortfarande relativt ny och mångaorganisationer har möjligtvis gått för fort fram i implementeringen av dessa verktyg. Dennastudie syftar till att undersöka fotbollsorganisationers arbete när det gäller deras Big Dataanalys kopplat till organisationens spelare utifrån en fallanalys. Resultatet visar att båda organisationerna skapar värde ur sina investeringar som de har nytta av i arbetet med att nå sina strategiska mål. Detta gör organisationerna på olika sätt. Vilket sätt som är mest effektivt utifrån värdeskapande går inte att svara på utifrån denna studie.
APA, Harvard, Vancouver, ISO, and other styles
19

Sánchez, Adam. "Big Data, Linked Data y Web semántica." Universidad Peruana de Ciencias Aplicadas (UPC), 2016. http://hdl.handle.net/10757/620705.

Full text
Abstract:
Conferencia realizada en el marco de la Semana del Acceso Abierto Perú, llevada a cabo del 24 al 26 de Octubre de 2016 en Lima, Peru. Las instituciones organizadoras: Universidad Peruana de Ciencias aplciadasd (UPC), Pontificia Universidad Católica del Perú (PUCP) y Universidad Peruana Cayetano Heredia (UPCH).
Conferencia que aborda aspectos del protocolo Linked Data, temas de Big Data y Web Semantica,
APA, Harvard, Vancouver, ISO, and other styles
20

Nyström, Simon, and Joakim Lönnegren. "Processing data sources with big data frameworks." Thesis, KTH, Data- och elektroteknik, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-188204.

Full text
Abstract:
Big data is a concept that is expanding rapidly. As more and more data is generatedand garnered, there is an increasing need for efficient solutions that can be utilized to process all this data in attempts to gain value from it. The purpose of this thesis is to find an efficient way to quickly process a large number of relatively small files. More specifically, the purpose is to test two frameworks that can be used for processing big data. The frameworks that are tested against each other are Apache NiFi and Apache Storm. A method is devised in order to, firstly, construct a data flow and secondly, construct a method for testing the performance and scalability of the frameworks running this data flow. The results reveal that Apache Storm is faster than Apache NiFi, at the sort of task that was tested. As the number of nodes included in the tests went up, the performance did not always do the same. This indicates that adding more nodes to a big data processing pipeline, does not always result in a better performing setup and that, sometimes, other measures must be made to heighten the performance.
Big data är ett koncept som växer snabbt. När mer och mer data genereras och samlas in finns det ett ökande behov av effektiva lösningar som kan användas föratt behandla all denna data, i försök att utvinna värde från den. Syftet med detta examensarbete är att hitta ett effektivt sätt att snabbt behandla ett stort antal filer, av relativt liten storlek. Mer specifikt så är det för att testa två ramverk som kan användas vid big data-behandling. De två ramverken som testas mot varandra är Apache NiFi och Apache Storm. En metod beskrivs för att, för det första, konstruera ett dataflöde och, för det andra, konstruera en metod för att testa prestandan och skalbarheten av de ramverk som kör dataflödet. Resultaten avslöjar att Apache Storm är snabbare än NiFi, på den typen av test som gjordes. När antalet noder som var med i testerna ökades, så ökade inte alltid prestandan. Detta visar att en ökning av antalet noder, i en big data-behandlingskedja, inte alltid leder till bättre prestanda och att det ibland krävs andra åtgärder för att öka prestandan.
APA, Harvard, Vancouver, ISO, and other styles
21

Tran, Viet-Trung. "Scalable data-management systems for Big Data." Phd thesis, École normale supérieure de Cachan - ENS Cachan, 2013. http://tel.archives-ouvertes.fr/tel-00920432.

Full text
Abstract:
Big Data can be characterized by 3 V's. * Big Volume refers to the unprecedented growth in the amount of data. * Big Velocity refers to the growth in the speed of moving data in and out management systems. * Big Variety refers to the growth in the number of different data formats. Managing Big Data requires fundamental changes in the architecture of data management systems. Data storage should continue being innovated in order to adapt to the growth of data. They need to be scalable while maintaining high performance regarding data accesses. This thesis focuses on building scalable data management systems for Big Data. Our first and second contributions address the challenge of providing efficient support for Big Volume of data in data-intensive high performance computing (HPC) environments. Particularly, we address the shortcoming of existing approaches to handle atomic, non-contiguous I/O operations in a scalable fashion. We propose and implement a versioning-based mechanism that can be leveraged to offer isolation for non-contiguous I/O without the need to perform expensive synchronizations. In the context of parallel array processing in HPC, we introduce Pyramid, a large-scale, array-oriented storage system. It revisits the physical organization of data in distributed storage systems for scalable performance. Pyramid favors multidimensional-aware data chunking, that closely matches the access patterns generated by applications. Pyramid also favors a distributed metadata management and a versioning concurrency control to eliminate synchronizations in concurrency. Our third contribution addresses Big Volume at the scale of the geographically distributed environments. We consider BlobSeer, a distributed versioning-oriented data management service, and we propose BlobSeer-WAN, an extension of BlobSeer optimized for such geographically distributed environments. BlobSeer-WAN takes into account the latency hierarchy by favoring locally metadata accesses. BlobSeer-WAN features asynchronous metadata replication and a vector-clock implementation for collision resolution. To cope with the Big Velocity characteristic of Big Data, our last contribution feautures DStore, an in-memory document-oriented store that scale vertically by leveraging large memory capability in multicore machines. DStore demonstrates fast and atomic complex transaction processing in data writing, while maintaining high throughput read access. DStore follows a single-threaded execution model to execute update transactions sequentially, while relying on a versioning concurrency control to enable a large number of simultaneous readers.
APA, Harvard, Vancouver, ISO, and other styles
22

Cao, Yang. "Querying big data with bounded data access." Thesis, University of Edinburgh, 2016. http://hdl.handle.net/1842/25421.

Full text
Abstract:
Query answering over big data is cost-prohibitive. A linear scan of a dataset D may take days with a solid state device if D is of PB size and years if D is of EB size. In other words, polynomial-time (PTIME) algorithms for query evaluation are already not feasible on big data. To tackle this, we propose querying big data with bounded data access, such that the cost of query evaluation is independent of the scale of D. First of all, we propose a class of boundedly evaluable queries. A query Q is boundedly evaluable under a set A of access constraints if for any dataset D that satisfies constraints in A, there exists a subset DQ ⊆ D such that (a) Q(DQ) = Q(D), and (b) the time for identifying DQ from D, and hence the size |DQ| of DQ, are independent of |D|. That is, we can compute Q(D) by accessing a bounded amount of data no matter how big D grows.We study the problem of deciding whether a query is boundedly evaluable under A. It is known that the problem is undecidable for FO without access constraints. We show that, in the presence of access constraints, it is decidable in 2EXPSPACE for positive fragments of FO queries, but is already EXPSPACE-hard even for CQ. To handle the undecidability and high complexity of the analysis, we develop effective syntax for boundedly evaluable queries under A, referred to as queries covered by A, such that, (a) any boundedly evaluable query under A is equivalent to a query covered by A, (b) each covered query is boundedly evaluable, and (c) it is efficient to decide whether Q is covered by A. On top of DBMS, we develop practical algorithms for checking whether queries are covered by A, and generating bounded plans if so. For queries that are not boundedly evaluable, we extend bounded evaluability to resource-bounded approximation and bounded query rewriting using views. (1) Resource-bounded approximation is parameterized with a resource ratio a ∈ (0,1], such that for any query Q and dataset D, it computes approximate answers with an accuracy bound h by accessing at most a|D| tuples. It is based on extended access constraints and a new accuracy measure. (2) Bounded query rewriting tackles the problem by incorporating bounded evaluability with views, such that the queries can be exactly answered by accessing cached views and a bounded amount of data in D. We study the problem of deciding whether a query has a bounded rewriting, establish its complexity bounds, and develop effective syntax for FO queries with a bounded rewriting. Finally, we extend bounded evaluability to graph pattern queries, by extending access constraints to graph data. We characterize bounded evaluability for subgraph and simulation patterns and develop practical algorithms for associated problems.
APA, Harvard, Vancouver, ISO, and other styles
23

Al-Hashemi, Idrees Yousef. "Applying data mining techniques over big data." Thesis, Boston University, 2013. https://hdl.handle.net/2144/21119.

Full text
Abstract:
Thesis (M.S.C.S.) PLEASE NOTE: Boston University Libraries did not receive an Authorization To Manage form for this thesis or dissertation. It is therefore not openly accessible, though it may be available by request. If you are the author or principal advisor of this work and would like to request open access for it, please contact us at open-help@bu.edu. Thank you.
The rapid development of information technology in recent decades means that data appear in a wide variety of formats — sensor data, tweets, photographs, raw data, and unstructured data. Statistics show that there were 800,000 Petabytes stored in the world in 2000. Today’s internet has about 0.1 Zettabytes of data (ZB is about 1021 bytes), and this number will reach 35 ZB by 2020. With such an overwhelming flood of information, present data management systems are not able to scale to this huge amount of raw, unstructured data—in today’s parlance, Big Data. In the present study, we show the basic concepts and design of Big Data tools, algorithms, and techniques. We compare the classical data mining algorithms to the Big Data algorithms by using Hadoop/MapReduce as a core implementation of Big Data for scalable algorithms. We implemented the K-means algorithm and A-priori algorithm with Hadoop/MapReduce on a 5 nodes Hadoop cluster. We explore NoSQL databases for semi-structured, massively large-scaling of data by using MongoDB as an example. Finally, we show the performance between HDFS (Hadoop Distributed File System) and MongoDB data storage for these two algorithms.
APA, Harvard, Vancouver, ISO, and other styles
24

Erlandsson, Niklas. "Game Analytics och Big Data." Thesis, Mittuniversitetet, Avdelningen för arkiv- och datavetenskap, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:miun:diva-29185.

Full text
Abstract:
Game Analytics är ett område som vuxit fram under senare år. Spelutvecklare har möjligheten att analysera hur deras kunder använder deras produkter ned till minsta knapptryckning. Detta kan resultera i stora mängder data och utmaning ligger i att lyckas göra något vettigt av sitt data. Utmaningarna med speldata beskrivs ofta med liknande egenskaper som används för att beskriva Big Data: volume, velocity och variability. Detta borde betyda att det finns potential för ett givande samarbete. Studiens syfte är att analysera och utvärdera vilka möjligheter Big Data ger att utveckla området Game Analytics. För att uppfylla syftet genomförs en litteraturstudie och semi-strukturerade intervjuer med individer aktiva inom spelbranschen. Resultatet visar att källorna är överens om att det finns värdefull information bland det data som kan lagras, framförallt i de monetära, generella och centrala (core) till spelet värdena. Med mer avancerad analys kan flera andra intressanta mönster grävas fram men ändå är det övervägande att hålla sig till de enklare variablerna och inte bry sig om att gräva djupare. Det är inte för att datahanteringen skulle bli för omständlig och svår utan för att analysen är en osäker investering. Även om någon tar sig an alla utmaningar speldata ställer fram finns det en osäkerhet på informationens tillit och användbarheten hos svaren. Framtidsvisionerna inom Game Analytics är blygsamma och inom den närmsta framtiden är det nästan bara effektiviseringar och en utbredning som förutspås vilket inte direkt ställer några nya krav på datahanteringen.
Game Analytics is a research field that appeared recently. Game developers have the ability to analyze how customers use their products down to every button pressed. This can result in large amounts of data and the challenge is to make sense of it all. The challenges with game data is often described with the same characteristics used to define Big Data: volume, velocity and variability. This should mean that there is potential for a fruitful collaboration. The purpose of this study is to analyze and evaluate what possibilities Big Data has to develop the Game Analytics field. To fulfill this purpose a literature review and semi-structured interviews with people active in the gaming industry were conducted. The results show that the sources agree that valuable information can be found within the data you can store, especially in the monetary, general and core values to the specific game. With more advanced analysis you may find other interesting patterns as well but nonetheless the predominant way seems to be sticking to the simple variables and staying away from digging deeper. It is not because data handling or storing would be tedious or too difficult but simply because the analysis would be too risky of an investment. Even if you have someone ready to take on all the challenges game data sets up, there is not enough trust in the answers or how useful they might be. Visions of the future within the field are very modest and the nearest future seems to hold mostly efficiency improvements and a widening of the field, making it reach more people. This does not really post any new demands or requirements on the data handling.
APA, Harvard, Vancouver, ISO, and other styles
25

Francke, Angela, and Sven Lißner. "Big Data in Bicycle Traffic." Saechsische Landesbibliothek- Staats- und Universitaetsbibliothek Dresden, 2018. http://nbn-resolving.de/urn:nbn:de:bsz:14-qucosa-233278.

Full text
Abstract:
For cycling to be attractive, the infrastructure must be of high quality. Due to the high level of resources required to record it locally, the available data on the volume of cycling traffic has to date been patchy. At the moment, the most reliable and usable numbers seem to be derived from permanently installed automatic cycling traffic counters, already used by many local authorities. One disadvantage of these is that the number of data collection points is generally far too low to cover the entirety of a city or other municipality in a way that achieves truly meaningful results. The effect of side roads on cycling traffic is therefore only incompletely assessed. Furthermore, there is usually no data at all on other parameters, such as waiting times, route choices and cyclists’ speed. This gap might in future be filled by methods such as GPS route data, as is now possible by today’s widespread use of smartphones and the relevant tracking apps. The results of the project presented in this guide have been supported by the BMVI [Federal Ministry of Transport and Digital Infrastructure] within the framework of its 2020 National Cycling Plan. This research project seeks to investigate the usability of user data generated using a smartphone app for bicycle traffic planning by local authorities. In summary, it can be stated that, taking into account the factors described in this guide, GPS data are usable for bicycle traffic planning within certain limitations. (The GPS data evaluated in this case were provided by Strava Inc.) Nowadays it is already possible to assess where, when and how cyclists are moving around across the entire network. The data generated by the smartphone app could be most useful to local authorities as a supplement to existing permanent traffic counters. However, there are a few aspects that need to be considered when evaluating and interpreting the data, such as the rather fitness-oriented context of the routes surveyed in the examples examined. Moreover, some of the data is still provided as database or GIS files, although some online templates that are easier to use are being set up, and some can already be used in a basic initial form. This means that evaluation and interpretation still require specialist expertise as well as human resources. However, the need for these is expected to reduce in the future with the further development of web interfaces and supporting evaluation templates. For this to work, developers need to collaborate with local authorities to work out what parameters are needed as well as the most suitable formats. This research project carried out an approach to extrapolating cycling traffic volumes from random samples of GPS data over the whole network. This was also successfully verified in another municipality. Further research is still nevertheless required in the future, as well as adaptation to the needs of different localities. Evidence for the usability of GPS data in practice still needs to be acquired in the near future. The cities of Dresden, Leipzig and Mainz could be taken as examples for this, as they have all already taken their first steps in the use of GPS data in planning for and supporting cycling. These steps make sense in the light of the increasing digitisation of traffic and transport and the growing amount of data available as a result – despite the limitations on these data to date – so that administrative bodies can start early in building up the appropriate skills among their staff. The use of GPS data would yield benefits for bicycle traffic planning in the long run. In addition, the active involvement of cyclists opens up new possibilities in communication and citizen participation – even without requiring specialist knowledge. This guide delivers a practical introduction to the topic, giving a comprehensive overview of the opportunities, obstacles and potential offered by GPS data.
APA, Harvard, Vancouver, ISO, and other styles
26

Doucet, Rachel A., Deyan M. Dontchev, Javon S. Burden, and Thomas L. Skoff. "Big data analytics test bed." Thesis, Monterey, California: Naval Postgraduate School, 2013. http://hdl.handle.net/10945/37615.

Full text
Abstract:
Approved for public release; distribution is unlimited
The proliferation of big data has significantly expanded the quantity and breadth of information throughout the DoD. The task of processing and analyzing this data has become difficult, if not infeasible, using traditional relational databases. The Navy has a growing priority for information processing, exploitation, and dissemination, which makes use of the vast network of sensors that produce a large amount of big data. This capstone report explores the feasibility of a scalable Tactical Cloud architecture that will harness and utilize the underlying open-source tools for big data analytics. A virtualized cloud environment was built and analyzed at the Naval Postgraduate School, which offers a test bed, suitable for studying novel variations of these architectures. Further, the technologies directly used to implement the test bed seek to demonstrate a sustainable methodology for rapidly configuring and deploying virtualized machines and provides an environment for performance benchmark and testing. The capstone findings indicate the strategies and best practices to automate the deployment, provisioning and management of big data clusters. The functionality we seek to support is a far more general goal: finding open-source tools that help to deploy and configure large clusters for on-demand big data analytics.
APA, Harvard, Vancouver, ISO, and other styles
27

Lansley, Guy David. "Big data : geodemographics and representation." Thesis, University College London (University of London), 2018. http://discovery.ucl.ac.uk/10045119/.

Full text
Abstract:
Due to the harmonisation of data collection procedures with everyday activities, Big Data can be harnessed to produce geodemographic representations to supplement or even replace traditional sources of population data which suffer from low response rates or intermittent refreshes. Furthermore, the velocity and diversity of new forms of data also enable the creation entirely new forms of geodemographic insight. However, their miscellaneous data collection procedures are inconsistent, unregulated and are not robustly sampled like conventional social sciences data sources. Therefore, uncertainty is inherent when attempting to glean representative research on the population at large from Big Data. All data are of partial coverage; however, the provenance Big Data is poorly understood. Consequently, the use of said data has epistemologically shifted how geographers build representations of the population. In repurposing Big Data, researchers might encounter a variety of data types that are not readily suitable for quantitative analysis and may represent geodemographic phenomena indirectly. Furthermore, whilst there are considerable barriers acquiring data pertaining to people and their actions, it is also challenging to link Big Data. In light of this, this work explores the fundamental challenges of using geospatial Big Data to represent the population and their activities across space and time. These are demonstrated through original research on various big datasets, they include Consumer Registers (which comprise public versions of the Electoral Register and consumer data), Driver and Vehicle Licencing Agency (DVLA) car registration data, and geotagged Twitter posts. While this thesis is critical of Big Data, it remains optimistic of their potential value and demonstrates techniques through which uncertainty can be identified or mitigated to an extent. In the process it also exemplifies how new forms of data can produce geodemographic insight that was previously unobservable on a large scale.
APA, Harvard, Vancouver, ISO, and other styles
28

Cao, Lei. "Outlier Detection In Big Data." Digital WPI, 2016. https://digitalcommons.wpi.edu/etd-dissertations/82.

Full text
Abstract:
The dissertation focuses on scaling outlier detection to work both on huge static as well as on dynamic streaming datasets. Outliers are patterns in the data that do not conform to the expected behavior. Outlier detection techniques are broadly applied in applications ranging from credit fraud prevention, network intrusion detection to stock investment tactical planning. For such mission critical applications, a timely response often is of paramount importance. Yet the processing of outlier detection requests is of high algorithmic complexity and resource consuming. In this dissertation we investigate the challenges of detecting outliers in big data -- in particular caused by the high velocity of streaming data, the big volume of static data and the large cardinality of the input parameter space for tuning outlier mining algorithms. Effective optimization techniques are proposed to assure the responsiveness of outlier detection in big data. In this dissertation we first propose a novel optimization framework called LEAP to continuously detect outliers over data streams. The continuous discovery of outliers is critical for a large range of online applications that monitor high volume continuously evolving streaming data. LEAP encompasses two general optimization principles that utilize the rarity of the outliers and the temporal priority relationships among stream data points. Leveraging these two principles LEAP not only is able to continuously deliver outliers with respect to a set of popular outlier models, but also provides near real-time support for processing powerful outlier analytics workloads composed of large numbers of outlier mining requests with various parameter settings. Second, we develop a distributed approach to efficiently detect outliers over massive-scale static data sets. In this big data era, as the volume of the data advances to new levels, the power of distributed compute clusters must be employed to detect outliers in a short turnaround time. In this research, our approach optimizes key factors determining the efficiency of distributed data analytics, namely, communication costs and load balancing. In particular we prove the traditional frequency-based load balancing assumption is not effective. We thus design a novel cost-driven data partitioning strategy that achieves load balancing. Furthermore, we abandon the traditional one detection algorithm for all compute nodes approach and instead propose a novel multi-tactic methodology which adaptively selects the most appropriate algorithm for each node based on the characteristics of the data partition assigned to it. Third, traditional outlier detection systems process each individual outlier detection request instantiated with a particular parameter setting one at a time. This is not only prohibitively time-consuming for large datasets, but also tedious for analysts as they explore the data to hone in on the most appropriate parameter setting or on the desired results. We thus design an interactive outlier exploration paradigm that is not only able to answer traditional outlier detection requests in near real-time, but also offers innovative outlier analytics tools to assist analysts to quickly extract, interpret and understand the outliers of interest. Our experimental studies including performance evaluation and user studies conducted on real world datasets including stock, sensor, moving object, and Geolocation datasets confirm both the effectiveness and efficiency of the proposed approaches.
APA, Harvard, Vancouver, ISO, and other styles
29

Talbot, David. "Bloom maps for big data." Thesis, University of Edinburgh, 2010. http://hdl.handle.net/1842/25235.

Full text
Abstract:
The ability to retrieve a value given a key is fundamental in computer science. Unfortunately as the a priori set from which keys are drawn grows in size, any exact data structure must use more space per key. This motivates our interest in approximate data structures. We consider the problem of succinctly encoding a map to support queries with bounded error when the distribution over values is known. We give a lower bound on the space required per key in terms of the entropy of the distribution over values and the error rate and present a generalization of the Bloom filter, the Bloom map, that achieves the lower bound up to a small constant factor. We then develop static and on-line approximation schemes for frequency data that use constant space per key to store frequencies with bounded relative error when these follow a power law. Our on-line construction has constant expected update complexity per observation and requires only a single pass over a data set. Finally we present a simple framework for using a priori knowledge to reduce the error rate of an approximate data structure with one-sided error. We evaluate the data structures proposed here empirically and use them to construct randomized language models that significantly reduce the space requirements of a state-of-the-art statistical machine translation system.
APA, Harvard, Vancouver, ISO, and other styles
30

Rupprecht, Lukas. "Network-aware big data processing." Thesis, Imperial College London, 2017. http://hdl.handle.net/10044/1/52455.

Full text
Abstract:
The scale-out approach of modern data-parallel frameworks such as Apache Flink or Apache Spark has enabled them to deal with large amounts of data. These applications are often deployed in large-scale data centres with many resources. However, as deployments and data continue to grow, more network communication is incurred during a data processing query. At the same time, data centre networks (DCNs) are becoming increasingly more complex in terms of the physical network topology, the variety of applications that are sharing the network, and the different requirements of these applications on the network. The high complexity of DCNs combined with the increased traffic demands of applications has made the network a bottleneck for query performance. In this thesis, we explore ways of making data-parallel frameworks network-aware, i.e. we combine specific knowledge about the application and the physical network to reduce query completion times. We identify three main types of traffic that occur during query processing and add network-awareness to each of them to optimise network usage. 1) Traffic reduction for aggregatable traffic exploits the physical network topology and the associativity and commutativity of aggregation queries to reduce traffic as early as possible. In-network aggregation trees utilise existing networking hardware and the tree topology of DCNs to partially aggregate and thereby reduce data as it flows through the network. 2) Traffic balancing for non-aggregatable traffic monitors the network throughput of an application and uses knowledge about the query to optimise the overall network utilisation. By dynamically changing the destinations of parts of the transferred data, network hotspots, which can occur when many applications share the network, can be avoided. 3) Traffic elimination for storage traffic gives control over data placement to the application instead of the distributed storage system. This allows the application to optimise where data is stored across the cluster based on application properties and thereby eliminate unnecessary network traffic.
APA, Harvard, Vancouver, ISO, and other styles
31

Andersson, Andreas. "Big data - det nya hälsoverktyget?" Thesis, Linnéuniversitetet, Institutionen för idrottsvetenskap (ID), 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-56519.

Full text
Abstract:
En inblick i ett nytt och snabbt växande område. En studie som undersöker användningen av datainsamling och Big data inom hälsoföretag. Syftet grundas i att skapa en kunskap och medvetenhet om hur det i dagens hälsoföretag ser ut inom denna del. Genom en granskning av nio företags användarvillkor samt deras integritetspolicy finner vi att samtliga företag samlar och spar data om sina kunder. Insamlingen sker utan användarens vetskap och denna Big data delas sedan vidare till andra företag som har användning för den.
APA, Harvard, Vancouver, ISO, and other styles
32

Слишинська, В. О., and Ігор Віталійович Пономаренко. "Використання Big Data в маркетингу." Thesis, КНУТД, 2016. https://er.knutd.edu.ua/handle/123456789/4082.

Full text
APA, Harvard, Vancouver, ISO, and other styles
33

Панферова, И. Ю. "Анализ неструктурированных данных big data." Thesis, Академія внутрішніх військ МВС України, 2017. http://openarchive.nure.ua/handle/document/9973.

Full text
APA, Harvard, Vancouver, ISO, and other styles
34

Luo, Changqing. "Towards Secure Big Data Computing." Case Western Reserve University School of Graduate Studies / OhioLINK, 2018. http://rave.ohiolink.edu/etdc/view?acc_num=case1529929603348119.

Full text
APA, Harvard, Vancouver, ISO, and other styles
35

Šoltýs, Matej. "Big Data v technológiách IBM." Master's thesis, Vysoká škola ekonomická v Praze, 2014. http://www.nusl.cz/ntk/nusl-193914.

Full text
Abstract:
This diploma thesis presents Big Data technologies and their possible use cases and applications. Theoretical part is initially focused on definition of term Big Data and afterwards is focused on Big Data technology, particularly on Hadoop framework. There are described principles of Hadoop, such as distributed storage and data processing, and its individual components. Furthermore are presented the largest vendors of Big Data technologies. At the end of this part of the thesis are described possible use cases of Big Data technologies and also some case studies. The practical part describes implementation of demo example of Big Data technologies and it is divided into two chapters. The first chapter of the practical part deals with conceptual design of demo example, used products and architecture of the solution. Afterwards, implementation of the demo example is described in the second chapter, from preparation of demo environment to creation of applications. Goals of this thesis are description and characteristics of Big Data, presentation of the largest vendors and their Big Data products, description of possible use cases of Big Data technologies and especially implementation of demo example in Big Data tools from IBM.
APA, Harvard, Vancouver, ISO, and other styles
36

Miloš, Marek. "Nástroje pro Big Data Analytics." Master's thesis, Vysoká škola ekonomická v Praze, 2013. http://www.nusl.cz/ntk/nusl-199274.

Full text
Abstract:
The thesis covers the term for specific data analysis called Big Data. The thesis firstly defines the term Big Data and the need for its creation because of the rising need for deeper data processing and analysis tools and methods. The thesis also covers some of the technical aspects of Big Data tools, focusing on Apache Hadoop in detail. The later chapters contain Big Data market analysis and describe the biggest Big Data competitors and tools. The practical part of the thesis presents a way of using Apache Hadoop to perform data analysis with data from Twitter and the results are then visualized in Tableau.
APA, Harvard, Vancouver, ISO, and other styles
37

Al-Salim, Ali Mahdi Ali. "Energy efficient big data networks." Thesis, University of Leeds, 2018. http://etheses.whiterose.ac.uk/20640/.

Full text
Abstract:
The continuous increase of big data applications in number and types creates new challenges that should be tackled by the green ICT community. Data scientists classify big data into four main categories (4Vs): Volume (with direct implications on power needs), Velocity (with impact on delay requirements), Variety (with varying CPU requirements and reduction ratios after processing) and Veracity (with cleansing and backup constraints). Each V poses many challenges that confront the energy efficiency of the underlying networks carrying big data traffic. In this work, we investigated the impact of the big data 4Vs on energy efficient bypass IP over WDM networks. The investigation is carried out by developing Mixed Integer Linear Programming (MILP) models that encapsulate the distinctive features of each V. In our analyses, the big data network is greened by progressively processing big data raw traffic at strategic locations, dubbed as processing nodes (PNs), built in the network along the path from big data sources to the data centres. At each PN, raw data is processed and lower rate useful information is extracted progressively, eventually reducing the network power consumption. For each V, we conducted an in-depth analysis and evaluated the network power saving that can be achieved by the energy efficient big data network compared to the classical approach. Along the volume dimension of big data, the work dealt with optimally handling and processing an enormous amount of big data Chunks and extracting the corresponding knowledge carried by those Chunks, transmitting knowledge instead of data, thus reducing the data volume and saving power. Variety means that there are different types of big data such as CPU intensive, memory intensive, Input/output (IO) intensive, CPU-Memory intensive, CPU/IO intensive, and memory-IO intensive applications. Each type requires a different amount of processing, memory, storage, and networking resources. The processing of different varieties of big data was optimised with the goal of minimising power consumption. In the velocity dimension, we classified the processing velocity of big data into two modes: expedited-data processing mode and relaxed-data processing mode. Expedited-data demanded higher amount of computational resources to reduce the execution time compared to the relaxed-data. The big data processing and transmission were optimised given the velocity dimension to reduce power consumption. Veracity specifies trustworthiness, data protection, data backup, and data cleansing constraints. We considered the implementation of data cleansing and backup operations prior to big data processing so that big data is cleansed and readied for entering big data analytics stage. The analysis was carried out through dedicated scenarios considering the influence of each V’s characteristic parameters. For the set of network parameters we considered, our results for network energy efficiency under the impact of volume, variety, velocity and veracity scenarios revealed that up to 52%, 47%, 60%, 58%, network power savings can be achieved by the energy efficient big data networks approach compared to the classical approach, respectively.
APA, Harvard, Vancouver, ISO, and other styles
38

Neagu, Daniel, and A.-N. Richarz. "Big data in predictive toxicology." Royal Society of Chemistry, 2019. http://hdl.handle.net/10454/17603.

Full text
Abstract:
No
The rate at which toxicological data is generated is continually becoming more rapid and the volume of data generated is growing dramatically. This is due in part to advances in software solutions and cheminformatics approaches which increase the availability of open data from chemical, biological and toxicological and high throughput screening resources. However, the amplified pace and capacity of data generation achieved by these novel techniques presents challenges for organising and analysing data output. Big Data in Predictive Toxicology discusses these challenges as well as the opportunities of new techniques encountered in data science. It addresses the nature of toxicological big data, their storage, analysis and interpretation. It also details how these data can be applied in toxicity prediction, modelling and risk assessment.
APA, Harvard, Vancouver, ISO, and other styles
39

Potter, Justin Gregory. "Big data adoption in SMMEs." Diss., University of Pretoria, 2015. http://hdl.handle.net/2263/52297.

Full text
Abstract:
Big data and the use of big data analytics is being adopted more frequently, especially in large organisations that have the resources to deploy it. Big data analytics is allowing businesses to optimise operations and gain deeper insights into their customers needs and behaviours. There is, however, almost no published research into how big data analytics is being used by SMMEs and how they are doing this despite having constrained resources. The objective of this research was to explore the factors that contribute to the adoption of big data analytics in SMMEs. Nine qualitative, semi-structured interviews using the long interview method were conducted with respondents who worked in senior management positions in SMMEs that were using some form of big data analytics. Eight of these respondents were at EXCO level, seven had some level of business ownership and five were the Managing Directors of the organisation. The study found the use of evidence in decision-making and entrepreneurial orientation to be present. These organisations are both proactive and innovative, and limit their risk by the use of experimentation. This provides insights into how these companies develop novel business models through the use of cloud services and by providing the ability to digest and analyse data on their client s behalf. A framework is proposed and adapted. Suggestions for future research and limitations of the study are presented.
Mini Dissertation (MBA)--University of Pretoria, 2015.
sn2016
Gordon Institute of Business Science (GIBS)
MBA
Unrestricted
APA, Harvard, Vancouver, ISO, and other styles
40

Mai, Luo. "Towards efficient big data processing in data centres." Thesis, Imperial College London, 2017. http://hdl.handle.net/10044/1/64817.

Full text
Abstract:
Large data processing systems require a high degree of coordination, and exhibit network bottlenecks due to massive communication data. This motivates my PhD study to propose system control mechanisms that improve monitoring and coordination, and efficient communication methods by bridging applications and networks. The first result is Chi, a new control plane for stateful streaming systems. Chi has a control loop that embeds control messages in data channels to seamlessly monitor and coordinate a streaming pipeline. This design helps monitor system and application-specific metrics in a scalable manner, and perform complex modification with on-the-fly data. The behaviours of control messages are customisable, thus enabling various control algorithms. Chi has been deployed into production systems, and exhibits high performance and scalability in test-bed experiments. With effective coordination, data-intensive systems need to remove network bottlenecks. This is important in data centres as their networks are usually over-subscribed. Hence, my study explores an idea that bridges applications and networks for accelerating communication. This idea can be realised (i) in the network core through a middlebox platform called NetAgg that can efficiently execute application-specific aggregation functions along busy network paths, and (ii) at network edges through a server network stack that provides powerful communication primitives and traffic management services. Test-bed experiments show that these methods can improve the communication of important analytics systems. A tight integration of applications and networks, however, requires an intuitive network programming model. My study thus proposes a network programming framework named Flick. Flick has a high-level programming language for application-specific network services. The services are compiled to dataflows and executed by a high-performance runtime. To be production-friendly, this runtime can run in commodity network elements and guarantee fair resource sharing among services. Flick has been used for developing popular network services, and its performance is shown in real-world benchmarks.
APA, Harvard, Vancouver, ISO, and other styles
41

Chitondo, Pepukayi David Junior. "Data policies for big health data and personal health data." Thesis, Cape Peninsula University of Technology, 2016. http://hdl.handle.net/20.500.11838/2479.

Full text
Abstract:
Thesis (MTech (Information Technology))--Cape Peninsula University of Technology, 2016.
Health information policies are constantly becoming a key feature in directing information usage in healthcare. After the passing of the Health Information Technology for Economic and Clinical Health (HITECH) Act in 2009 and the Affordable Care Act (ACA) passed in 2010, in the United States, there has been an increase in health systems innovations. Coupling this health systems hype is the current buzz concept in Information Technology, „Big data‟. The prospects of big data are full of potential, even more so in the healthcare field where the accuracy of data is life critical. How big health data can be used to achieve improved health is now the goal of the current health informatics practitioner. Even more exciting is the amount of health data being generated by patients via personal handheld devices and other forms of technology that exclude the healthcare practitioner. This patient-generated data is also known as Personal Health Records, PHR. To achieve meaningful use of PHRs and healthcare data in general through big data, a couple of hurdles have to be overcome. First and foremost is the issue of privacy and confidentiality of the patients whose data is in concern. Secondly is the perceived trustworthiness of PHRs by healthcare practitioners. Other issues to take into context are data rights and ownership, data suppression, IP protection, data anonymisation and reidentification, information flow and regulations as well as consent biases. This study sought to understand the role of data policies in the process of data utilisation in the healthcare sector with added interest on PHRs utilisation as part of big health data.
APA, Harvard, Vancouver, ISO, and other styles
42

Kalibjian, Jeff. ""Big Data" Management and Security Application to Telemetry Data Products." International Foundation for Telemetering, 2013. http://hdl.handle.net/10150/579664.

Full text
Abstract:
ITC/USA 2013 Conference Proceedings / The Forty-Ninth Annual International Telemetering Conference and Technical Exhibition / October 21-24, 2013 / Bally's Hotel & Convention Center, Las Vegas, NV
"Big Data" [1] and the security challenge of managing "Big Data" is a hot topic in the IT world. The term "Big Data" is used to describe very large data sets that cannot be processed by traditional database applications in "tractable" periods of time. Securing data in a conventional database is challenge enough; securing data whose size may exceed hundreds of terabytes or even petabytes is even more daunting! As the size of telemetry product and telemetry post-processed product continues to grow, "Big Data" management techniques and the securing of that data may have ever increasing application in the telemetry realm. After reviewing "Big Data", "Big Data" security and management basics, potential application to telemetry post-processed product will be explored.
APA, Harvard, Vancouver, ISO, and other styles
43

Grohsschmiedt, Steffen. "Making Big Data Smaller : Reducing the storage requirements for big data with erasure coding for Hadoop." Thesis, KTH, Skolan för informations- och kommunikationsteknik (ICT), 2014. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-177201.

Full text
Abstract:
The amount of data stored in modern data centres is growing rapidly nowadays. Large-scale distributed file systems, that maintain the massive data sets in data centres, are designed to work with commodity hardware. Due to the quality and quantity of the hardware components in such systems, failures are considered normal events and, as such, distributed file systems are designed to be highly fault-tolerant. A common approach to achieve fault tolerance is using redundancy by storing three copies of a file across different storage nodes, thereby increasing the storage requirements by a factor of three and further aggravating the storage problem. A concrete implementation of such a file system is the Hadoop Distributed File System (HDFS). This thesis explores the use of RAID-like mechanisms in order to decrease the storage requirements for big data. We designed and implemented a prototype that extends HDFS with a simple but powerful erasure coding API. Compared to existing approaches, we decided to locate the erasure-coding management logic in the HDFS NameNode, as this allows us to use internal HDFS APIs and state. Because of that, we can repair failures associated with erasurecoded files more quickly and with lower cost. We evaluate our prototype, and we also show that the use of erasure coding instead of replication can greatly decrease the storage requirements of big data without scarifying reliability and availability. Finally, we argue that our API can support a large range of custom encoding strategies, while adding the erasure coding logic to the NameNode can significantly improve the management of the encoded files.
APA, Harvard, Vancouver, ISO, and other styles
44

Rystadius, Gustaf, David Monell, and Linus Mautner. "The dynamic management revolution of Big Data : A case study of Åhlen’s Big Data Analytics operation." Thesis, Jönköping University, Internationella Handelshögskolan, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:hj:diva-48959.

Full text
Abstract:
Background: The implementation of Big Data Analytics (BDA) has drastically increased within several sectors such as retailing. Due to its rapidly altering environment, companies have to adapt and modify their business strategies and models accordingly. The concepts of ambidexterity and agility are said to act as mediators to these changes in relation to a company’s capabilities within BDA. Problem: Research within the respective fields of dynamic mediators and BDAC have been conducted, but the investigation of specific traits of these mediators, their interconnection and its impact on BDAC is scant. This actuality is seen as a surprise from scholars, calling for further empirical investigation.  Purpose: This paper sought to empirically investigate what specific traits of ambidexterity and agility that emerged within the case company of Åhlen’s BDA-operation, and how these traits are interconnected. It further studied how these traits and their interplay impacts the firm's talent and managerial BDAC. Method: A qualitative case study on the retail firm Åhlens was conducted with three participants central to the firm's BDA-operation. Semi-structured interviews were conducted with questions derived from the conceptual framework based upon reviewed literature and pilot interviews. The data was then analyzed and matched to literature using a thematic analysis approach.  Results: Five ambidextrous traits and three agile traits were found within Åhlen’s BDA-operation. Analysis of these traits showcased a clear positive impact on Åhlen’s BDAC, when properly interconnected. Further, it was found that in absence of such interplay, the dynamic mediators did not have as positive impact and occasionally even disruptive effects on the firm’s BDAC. Hence it was concluded that proper connection between the mediators had to be present in order to successfully impact and enhance the capabilities.
APA, Harvard, Vancouver, ISO, and other styles
45

Serra-Diaz, Josep M., Brian J. Enquist, Brian Maitner, Cory Merow, and Jens-C. Svenning. "Big data of tree species distributions: how big and how good?" SPRINGER HEIDELBERG, 2018. http://hdl.handle.net/10150/626611.

Full text
Abstract:
Background: Trees play crucial roles in the biosphere and societies worldwide, with a total of 60,065 tree species currently identified. Increasingly, a large amount of data on tree species occurrences is being generated worldwide: from inventories to pressed plants. While many of these data are currently available in big databases, several challenges hamper their use, notably geolocation problems and taxonomic uncertainty. Further, we lack a complete picture of the data coverage and quality assessment for open/public databases of tree occurrences. Methods: We combined data from five major aggregators of occurrence data (e.g. Global Biodiversity Information Facility, Botanical Information and Ecological Network v.3, DRYFLOR, RAINBIO and Atlas of Living Australia) by creating a workflow to integrate, assess and control data quality of tree species occurrences for species distribution modeling. We further assessed the coverage - the extent of geographical data - of five economically important tree families (Arecaceae, Dipterocarpaceae, Fagaceae, Myrtaceae, Pinaceae). Results: Globally, we identified 49,206 tree species (84.69% of total tree species pool) with occurrence records. The total number of occurrence records was 36.69 M, among which 6.40 M could be considered high quality records for species distribution modeling. The results show that Europe, North America and Australia have a considerable spatial coverage of tree occurrence data. Conversely, key biodiverse regions such as South-East Asia and central Africa and parts of the Amazon are still characterized by geographical open-public data gaps. Such gaps are also found even for economically important families of trees, although their overall ranges are covered. Only 15,140 species (26.05%) had at least 20 records of high quality. Conclusions: Our geographical coverage analysis shows that a wealth of easily accessible data exist on tree species occurrences worldwide, but regional gaps and coordinate errors are abundant. Thus, assessment of tree distributions will need accurate occurrence quality control protocols and key collaborations and data aggregation, especially from national forest inventory programs, to improve the current publicly available data.
APA, Harvard, Vancouver, ISO, and other styles
46

Bishop, Brenden. "Examining Random-Coeffcient Pattern-Mixture Models forLongitudinal Data with Informative Dropout." The Ohio State University, 2017. http://rave.ohiolink.edu/etdc/view?acc_num=osu150039066582153.

Full text
APA, Harvard, Vancouver, ISO, and other styles
47

McCaul, Christopher Francis. "Big Data: Coping with Data Obesity in Cloud Environments." Thesis, Ulster University, 2017. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.724751.

Full text
APA, Harvard, Vancouver, ISO, and other styles
48

Bernsdorf, Bodo, and Julian Bruns. "Big Data und Data-Mining im Umfeld städtischer Nutzungskartierung." Rhombos-Verlag, 2016. https://slub.qucosa.de/id/qucosa%3A16835.

Full text
Abstract:
Es ist festzustellen, dass die städtische Nutzungskartierung auf immer mehr Datenquellen zurückgreifen kann. Insbesondere handelt es sich um hochauflösende (Geo-)Daten von Fernerkundungsplattformen wie Satelliten aus dem Copernicus-Programm. Aber auch sogenannte Volunteer Geographic Information (VGI) spielen eine zunehmende Rolle. Speziell entwickelte Anwendungsprogramme, sogenannte „Apps“, kommen zum Sammeln solcher Rauminformationen in Frage. Und letztlich kommen Daten aus sozialen Netzwerken zum Tragen. Dieser Beitrag beschäftigt sich mit der Anwendung von Big Data im geo-temporalen Umfeld: Daten mit großen Volumina, die immer schneller in den Prozess gelangen, aus unterschiedlichsten Quellen stammen, unterschiedliche Informationsgehalte aufweisen und mit Unsicherheit behaftet sind. Sie liegen möglicherweise nicht flächendeckend vor, bieten mannigfaltige Bodenauflösungen, sind lückenhaft – dies sind alles Aspekte, die den gängigen Kriterien für „gute“ Daten widersprechen. Man wünscht sich flächendeckende, hochauflösende und hochaktuelle Daten. Der Vorteil bei der Nutzung von Big Data liegt nicht in der „Güte“, sondern in der massenhaften Verfügbarkeit. Der vorliegende Artikel ist als Werkstattbericht zu verstehen, der erste Ansätze in einem Anwendungsszenario zur Detektion sogenannter Intra-Urban Heat Islands, innerstädtischer Hitzeinseln, aufzeigt.
APA, Harvard, Vancouver, ISO, and other styles
49

Franceschini, Davide. "Panoramica sull'utilizzo etico dei Big Data." Bachelor's thesis, Alma Mater Studiorum - Università di Bologna, 2017. http://amslaurea.unibo.it/13809/.

Full text
Abstract:
In questa tesi mi sono occupato di Big Data. Che cosa sono, perchè sono importanti, che prospettive di utilizzo e di crescita hanno, da chi vengono utilizzati e a quali scopi. In particolare poi, mi sono soffermato sulle leggi che ne regolamentano gli usi. Quali norme in vigore ed in via di adozione sono presenti nel contesto nazionale e continentale. Un'attenzione particolare l'ho dedicata ad alcune grandi forze economiche che perseguono i propri interessi grazie alla rete. Di queste ultime ho raccontato in che modo si agevolano grazie ai Big Data rispetto alle concorrenti e di come, a volte, non ne facciano un utilizzo pienamente etico, anche se (non sempre) all'interno dei limiti di legge. Nel testo affronto le problematiche delle persone comuni nell'era digitale. Come la tutela dei loro diritti personali fondamentali, compresi sommariamente nel diritto alla privacy, sia messa a dura prova dalle nuove tecnologie in costante sviluppo. I pericoli che possono scaturire da pericolosi accentramenti di potere, dovuti al possesso di grosse quantità di dati da parte di pochi soggetti. Infine ho provato a suggerire alcuni approcci al tema che possano in qualche modo risolvere o quantomeno ridurre il problema del controllo sui propri dati. Riassumendo, considero questo lavoro come una visione ad ampio raggio delle possibilità e dei rischi che l'utilizzo dei Big Data comporta e comporterà in un futuro prossimo.
APA, Harvard, Vancouver, ISO, and other styles
50

Liu, Yang. "Statistical methods for big tracking data." Thesis, University of British Columbia, 2017. http://hdl.handle.net/2429/60916.

Full text
Abstract:
Recent advances in technology have led to large sets of tracking data, which brings new challenges in statistical modeling and prediction. Built on recent developments in Gaussian process modeling for spatio--temporal data and stochastic differential equations (SDEs), we develop a sequence of new models and corresponding inferential methods to meet these challenges. We first propose Bayesian Melding (BM) and downscaling frameworks to combine observations from different sources. To use BM for big tracking data, we exploit the properties of the processes along with approximations to the likelihood to break a high dimensional problem into a series of lower dimensional problems. To implement the downscaling approach, we apply the integrated nested Laplace approximation (INLA) to fit a linear mixed effect model that connects the two sources of observations. We apply these two approaches in a case study involving the tracking of marine mammals. Both of our frameworks have superior predictive performance compared with traditional approaches in both cross--validation and simulation studies. We further develop the BM frameworks with stochastic processes that can reflect the time varying features of the tracks. We first develop a conditional heterogeneous Gaussian Process (CHGP) but certain properties of this process make it extremely difficult to perform model selection. We also propose a linear SDE with splines as its coefficients, which we refer to as a generalized Ornstein-Ulhenbeck (GOU) process. The GOU achieves flexible modeling of the tracks in both mean and covariance with a reasonably parsimonious parameterization. Inference and prediction for this process can be computed via the Kalman filter and smoother. BM with the GOU achieves a smaller prediction error and better credibility intervals in cross validation comparisons to the basic BM and downscaling models. Following the success with the GOU, we further study a special class of SDEs called the potential field (PF) models, which formulates the drift term as the gradient of another function. We apply the PF approach to modeling of tracks of marine mammals as well as basketball players, and demonstrate its potential in learning, visualizing, and interpreting the trends in the paths.
Science, Faculty of
Statistics, Department of
Graduate
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography