Dissertations / Theses: 'Genomics Big Data Engineering'

1

Goldstein, Theodore C. "Tools for extracting actionable medical knowledge from genomic big data." Thesis, University of California, Santa Cruz, 2013. http://pqdtopen.proquest.com/#viewpdf?dispub=3589324.

Full text

Abstract:

Cancer is an ideal target for personal genomics-based medicine that uses high-throughput genome assays such as DNA sequencing, RNA sequencing, and expression analysis (collectively called omics); however, researchers and physicians are overwhelmed by the quantities of big data from these assays and cannot interpret this information accurately without specialized tools. To address this problem, I have created software methods and tools called OCCAM (OmiC data Cancer Analytic Model) and DIPSC (Differential Pathway Signature Correlation) for automatically extracting knowledge from this data and turning it into an actionable knowledge base called the activitome. An activitome signature measures a mutation's effect on the cellular molecular pathway. As well, activitome signatures can also be computed for clinical phenotypes. By comparing the vectors of activitome signatures of different mutations and clinical outcomes, intrinsic relationships between these events may be uncovered. OCCAM identifies activitome signatures that can be used to guide the development and application of therapies. DIPSC overcomes the confounding problem of correlating multiple activitome signatures from the same set of samples. In addition, to support the collection of this big data, I have developed MedBook, a federated distributed social network designed for a medical research and decision support system. OCCAM and DIPSC are two of the many apps that will operate inside of MedBook. MedBook extends the Galaxy system with a signature database, an end-user oriented application platform, a rich data medical knowledge-publishing model, and the Biomedical Evidence Graph (BMEG). The goal of MedBook is to improve the outcomes by learning from every patient.

APA, Harvard, Vancouver, ISO, and other styles

2

Miller, Chase Allen. "Towards a Web-Based, Big Data, Genomics Ecosystem." Thesis, Boston College, 2014. http://hdl.handle.net/2345/bc-ir:104052.

Full text

Abstract:

Thesis advisor: Gabor T. Marth
Rapid advances in genome sequencing enable a wide range of biological experiments on a scale that was until recently restricted to large genome centers. However, the analysis of the resulting vast genomic datasets is time-consuming, unintuitive and requires considerable computational expertise and costly infrastructure. Collectively, these factors effectively exclude many bench biologists from genome-scale analyses. Web-based visualization and analysis libraries, frameworks, and applications were developed to empower all biological researchers to easily, interactively, and in a visually driven manner, analyze large biomedical datasets that are essential for their research, without bioinformatics expertise and costly hardware
Thesis (PhD) — Boston College, 2014
Submitted to: Boston College. Graduate School of Arts and Sciences
Discipline: Biology

APA, Harvard, Vancouver, ISO, and other styles

3

Hansen, Simon, and Erik Markow. "Big Data : Implementation av Big Data i offentlig verksamhet." Thesis, Högskolan i Halmstad, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:hh:diva-38756.

Full text

APA, Harvard, Vancouver, ISO, and other styles

4

Kämpe, Gabriella. "How Big Data Affects UserExperienceReducing cognitive load in big data applications." Thesis, Umeå universitet, Institutionen för datavetenskap, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-163995.

Full text

Abstract:

We have entered the age of big data. Massive data sets are common in enterprises, government, and academia. Interpreting such scales of data is still hard for the human mind. This thesis investigates how proper design can decrease the cognitive load in data-heavy applications. It focuses on numeric data describing economic growth in retail organizations. It aims to answer the questions: What is important to keep in mind when designing an interface that holds large amounts of data? and How to decrease the cognitive load in complex user interfaces without reducing functionality?. It aims to answer these questions by comparing two user interfaces in terms of efficiency, structure, ease of use and navigation. Each interface holds the same functionality and amount of data, but one is designed to increase user experience by reducing cognitive load. The design choices in the second application are based on the theory found in the literature study in the thesis.

APA, Harvard, Vancouver, ISO, and other styles

5

Luo, Changqing. "Towards Secure Big Data Computing." Case Western Reserve University School of Graduate Studies / OhioLINK, 2018. http://rave.ohiolink.edu/etdc/view?acc_num=case1529929603348119.

Full text

APA, Harvard, Vancouver, ISO, and other styles

6

Schobel, Seth Adam Micah. "The viral genomics revolution| Big data approaches to basic viral research, surveillance, and vaccine development." Thesis, University of Maryland, College Park, 2016. http://pqdtopen.proquest.com/#viewpdf?dispub=10011480.

Full text

Abstract:

Since the decoding of the first RNA virus in 1976, the field of viral genomics has exploded, first through the use of Sanger sequencing technologies and later with the use next-generation sequencing approaches. With the development of these sequencing technologies, viral genomics has entered an era of big data. New challenges for analyzing these data are now apparent. Here, we describe novel methods to extend the current capabilities of viral comparative genomics. Through the use of antigenic distancing techniques, we have examined the relationship between the antigenic phenotype and the genetic content of influenza virus to establish a more systematic approach to viral surveillance and vaccine selection. Distancing of Antigenicity by Sequence-based Hierarchical Clustering (DASH) was developed and used to perform a retrospective analysis of 22 influenza seasons. Our methods produced vaccine candidates identical to or with a high concordance of antigenic similarity with those selected by the WHO. In a second effort, we have developed VirComp and OrionPlot: two independent yet related tools. These tools first generate gene-based genome constellations, or genotypes, of viral genomes, and second create visualizations of the resultant genome constellations. VirComp utilizes sequence-clustering techniques to infer genome constellations and prepares genome constellation data matrices for visualization with OrionPlot. OrionPlot is a java application for tailoring genome constellation figures for publication. OrionPlot allows for color selection of gene cluster assignments, customized box sizes to enable the visualization of gene comparisons based on sequence length, and label coloring. We have provided five analyses designed as vignettes to illustrate the utility of our tools for performing viral comparative genomic analyses. Study three focused on the analysis of respiratory syncytial virus (RSV) genomes circulating during the 2012- 2013 RSV season. We discovered a correlation between a recent tandem duplication within the G gene of RSV-A and a decrease in severity of infection. Our data suggests that this duplication is associated with a higher infection rate in female infants than is generally observed. Through these studies, we have extended the state of the art of genotype analysis, phenotype/genotype studies and established correlations between clinical metadata and RSV sequence data.

APA, Harvard, Vancouver, ISO, and other styles

7

Cheelangi, Madhusudan. "Result Distribution in Big Data Systems." Thesis, University of California, Irvine, 2013. http://pqdtopen.proquest.com/#viewpdf?dispub=1539891.

Full text

Abstract:

We are building a Big Data Management System (BDMS) called AsterixDB at UCI. Since AsterixDB is designed to operate on large volumes of data, the results for its queries can be potentially very large, and AsterixDB is also designed to operate under high concurency workloads. As a result, we need a specialized mechanism to manage these large volumes of query results and deliver them to the clients. In this thesis, we present an architecture and an implementation of a new result distribution framework that is capable of handling large volumes of results under high concurency workloads. We present the various components of this result distribution framework and show how they interact with each other to manage large volumes of query results and deliver them to clients. We also discuss various result distribution policies that are possible with our framework and compare their performance through experiments.

We have implemented a REST-like HTTP client interface on top of the result distribution framework to allow clients to submit queries and obtain their results. This client interface provides two modes for clients to choose from to read their query results: synchronous mode and asynchronous mode. In synchronous mode, query results are delivered to a client as a direct response to its query within the same request-response cycle. In asynchronous mode, a query handle is returned instead to the client as a response to its query. The client can store the handle and send another request later, including the query handle, to read the result for the query whenever it wants. The architectural support for these two modes is also described in this thesis. We believe that the result distribution framework, combined with this client interface, successfully meets the result management demands of AsterixDB.

APA, Harvard, Vancouver, ISO, and other styles

8

Laurila, M. (Mikko). "Big data in Finnish financial services." Bachelor's thesis, University of Oulu, 2017. http://urn.fi/URN:NBN:fi:oulu-201711243156.

Full text

Abstract:

This thesis aims to explore the concept of big data, and create understanding of big data maturity in the Finnish financial services industry. The research questions of this thesis are “What kind of big data solutions are being implemented in the Finnish financial services sector?” and “Which factors impede faster implementation of big data solutions in the Finnish financial services sector?”. Big data, being a concept usually linked with huge data sets and economies of scale, is an interesting topic for research in Finland, a market in which the size of data sets is somewhat limited by the size of the market. This thesis includes a literature review on the concept of big data, and earlier literature of the Finnish big data landscape, and a qualitative content analysis of available public information on big data maturity in the context of the Finnish financial services market. The results of this research show that in Finland big data is utilized to some extent, at least by the larger organizations. Financial services specific big data solutions include things like the automation of applications handling in insurance. The most clear and specific factors slowing the development of big data maturity in the industry are the lack of competent work-force and new regulations compliance projects taking development resources. These results can be used as an overview of the state of big data maturity in the Finnish financial services industry. This study also lays a solid foundation for further research in the form of conducting interviews, which would provide more in-depth data
Tämän työn tavoitteena on selvittää big data -käsitettä sekä kehittää ymmärrystä Suomen rahoitusalan big data -kypsyydestä. Tutkimuskysymykset tutkielmalle ovat “Millaisia big data -ratkaisuja on otettu käyttöön rahoitusalalla Suomessa?” sekä “Mitkä tekijät hidastavat big data -ratkaisujen implementointia rahoitusalalla Suomessa?”. Big data käsitteenä liitetään yleensä valtaviin datamassoihin ja suuruuden ekonomiaan. Siksi big data onkin mielenkiintoinen aihe tutkittavaksi suomalaisessa kontekstissa, missä datajoukkojen koko on jossain määrin rajoittunut markkinan koon myötä. Työssä esitetään big datan määrittely kirjallisuuteen perustuen sekä esitetään yhteenveto big datan soveltamisesta Suomessa aikaisempiin tutkimuksiin perustuen. Työssä on toteutettu laadullinen aineistoanalyysi julkisesti saatavilla olevasta informaatiosta big datan käytöstä rahoitusalalla Suomessa. Tulokset osoittavat big dataa hyödynnettävän jossain määrin rahoitusalalla Suomessa, ainakin suurikokoisissa organisaatioissa. Rahoitusalalle erityisiä ratkaisuja ovat esimerkiksi hakemuskäsittelyprosessien automatisointi. Selkeimmät big data -ratkaisujen implementointia hidastavat tekijät ovat osaavan työvoiman puute, sekä uusien regulaatioiden asettamat paineet kehitysresursseille. Työ muodostaa eräänlaisen kokonaiskuvan big datan hyödyntämisestä rahoitusalalla Suomessa. Tutkimus perustuu julkisen aineiston analyysiin, mikä osaltaan luo pohjan jatkotutkimukselle aiheesta. Jatkossa haastatteluilla voitaisiinkin edelleen syventää tietämystä aiheesta

APA, Harvard, Vancouver, ISO, and other styles

9

Flike, Felix, and Markus Gervard. "BIG DATA-ANALYS INOM FOTBOLLSORGANISATIONER En studie om big data-analys och värdeskapande." Thesis, Malmö universitet, Fakulteten för teknik och samhälle (TS), 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:mau:diva-20117.

Full text

Abstract:

Big data är ett relativt nytt begrepp men fenomenet har funnits länge. Det går att beskriva utifrån fem V:n; volume, veracity, variety, velocity och value. Analysen av Big Data har kommit att visa sig värdefull för organisationer i arbetet med beslutsfattande, generering av mätbara ekonomiska fördelar och förbättra verksamheten. Inom idrottsbranschen började detta på allvar användas i början av 2000-talet i baseballorganisationen Oakland Athletics. Man började värva spelare baserat på deras statistik istället för hur bra scouterna bedömde deras förmåga vilket gav stora framgångar. Detta ledde till att fler organisationer tog efter och det har inte dröjt länge innan Big Data-analys används i alla stora sporter för att vinna fördelar gentemot konkurrenter. I svensk kontext så är användningen av dessa verktyg fortfarande relativt ny och mångaorganisationer har möjligtvis gått för fort fram i implementeringen av dessa verktyg. Dennastudie syftar till att undersöka fotbollsorganisationers arbete när det gäller deras Big Dataanalys kopplat till organisationens spelare utifrån en fallanalys. Resultatet visar att båda organisationerna skapar värde ur sina investeringar som de har nytta av i arbetet med att nå sina strategiska mål. Detta gör organisationerna på olika sätt. Vilket sätt som är mest effektivt utifrån värdeskapande går inte att svara på utifrån denna studie.

APA, Harvard, Vancouver, ISO, and other styles

10

Nyström, Simon, and Joakim Lönnegren. "Processing data sources with big data frameworks." Thesis, KTH, Data- och elektroteknik, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-188204.

Full text

Abstract:

Big data is a concept that is expanding rapidly. As more and more data is generatedand garnered, there is an increasing need for efficient solutions that can be utilized to process all this data in attempts to gain value from it. The purpose of this thesis is to find an efficient way to quickly process a large number of relatively small files. More specifically, the purpose is to test two frameworks that can be used for processing big data. The frameworks that are tested against each other are Apache NiFi and Apache Storm. A method is devised in order to, firstly, construct a data flow and secondly, construct a method for testing the performance and scalability of the frameworks running this data flow. The results reveal that Apache Storm is faster than Apache NiFi, at the sort of task that was tested. As the number of nodes included in the tests went up, the performance did not always do the same. This indicates that adding more nodes to a big data processing pipeline, does not always result in a better performing setup and that, sometimes, other measures must be made to heighten the performance.
Big data är ett koncept som växer snabbt. När mer och mer data genereras och samlas in finns det ett ökande behov av effektiva lösningar som kan användas föratt behandla all denna data, i försök att utvinna värde från den. Syftet med detta examensarbete är att hitta ett effektivt sätt att snabbt behandla ett stort antal filer, av relativt liten storlek. Mer specifikt så är det för att testa två ramverk som kan användas vid big data-behandling. De två ramverken som testas mot varandra är Apache NiFi och Apache Storm. En metod beskrivs för att, för det första, konstruera ett dataflöde och, för det andra, konstruera en metod för att testa prestandan och skalbarheten av de ramverk som kör dataflödet. Resultaten avslöjar att Apache Storm är snabbare än NiFi, på den typen av test som gjordes. När antalet noder som var med i testerna ökades, så ökade inte alltid prestandan. Detta visar att en ökning av antalet noder, i en big data-behandlingskedja, inte alltid leder till bättre prestanda och att det ibland krävs andra åtgärder för att öka prestandan.

APA, Harvard, Vancouver, ISO, and other styles

11

Adler, Philip David Felix. "Crystalline cheminformatics : big data approaches to crystal engineering." Thesis, University of Southampton, 2015. https://eprints.soton.ac.uk/410940/.

Full text

Abstract:

Statistical approaches to chemistry, under the umbrella of cheminformatics, are now widespread - in particular as a part of quantitative activity structure relationship and quantitative property structure relationship studies on candidate pharmaceutical studies. Using such approaches on legacy data has widely been termed “taking a big data approach”, and finds ready application in cohort medicinal studies and psychological studies. Crystallography is a field ripe for these approaches, owing in no small part to its history as a field which, by necessity, adopted digital technologies relatively early on as a part of X-ray crystallographic techniques. A discussion of the historical background of crystallography, crystallographic engineering and of the pertinent areas of cheminformatics, which includes programming, databases, file formats, and statistics is given as background to the presented research. Presented here are a series of applications of Big Data techniques within the field of crystallography. Firstly, a naıve attempt at descriptor selection was attempted using a family of sulphonamide crystal structures and glycine crystal structures. This proved to be unsuccessful owing to the very large number of available descriptors and the very small number of true glycine polymorphs used in the experiment. Secondly, an attempt to combine machine learning model building with feature selection was made using co-crystal structures obtained from the Cambridge Structural Database, using partition modelling. This method established sensible sets of descriptors which would act as strong predictors for the formation of co-crystals, however, validation of the models by using them to make predictions demonstrated the poor predictive power of the models, and let to the uncovering of a number of weaknesses therein. Thirdly, a homologous series of fluorobenzeneanilides were used as a test bed for a novel, invariant topological descriptor. The descriptor itself is based from graph theoretical techniques, and is derived from the patterns of close-contacts within the crystal structure. Fluorobenzeneanilides present an interesting case in this context, because of the historical understanding that fluorine is rarely known to be a component in a hydrogen bonding system. Regardless, the descriptor correlates with the melting point of the fluorobenzeneanilides, with one exception. The reasons for this exception are explored. In addition, a comparison of categorisations of the crystal structure using more traditional “by-eye” techniques, and groupings of compounds by shared values of the invariant descriptor were undertaken. It is demonstrated that the novel descriptor does not simply act a proxy for the arrangement of the molecules in the crystal lattice- intuitively similar structures have different values for the descriptor while very different structures can have similar values. This is evidence that the general trend of exploring intermolecular contacts in isolation from other influences over lattice formation. The correlation of the descriptor with melting point in this context suggests that the properties of crystalline material are not only products of their lattice structure. Also presented as part of all of the case studies is an illustration of some weaknesses of the methodology, and a discussion of how these difficulties can be overcome, both by individual scientists and by necessary alterations to the collective approach to recording crystallographic experiments.

APA, Harvard, Vancouver, ISO, and other styles

12

Ohlsson, Anna, and Dan Öman. "A guide in the Big Data jungle." Thesis, Blekinge Tekniska Högskola, Institutionen för programvaruteknik, 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-1057.

Full text

Abstract:

This bachelor thesis looks at the functionality of different frameworks for data analysis atlarge scale and the purpose of it is to serve as a guide among available tools. The amount ofdata that is generated every day keep growing and for companies to take advantage of thedata they collect they need to know how to analyze it to gain maximal use out of it. Thechoice of platform for this analysis plays an important role and you need to look in to thefunctionality of the different alternatives that are available. We have created a guide to makethis research easier and less time consuming. To evaluate our work we created a summaryand a survey which we asked a number of ITstudents,current and previous, to take part in.After analyzing their answers we could see that most of them find our thesis interesting andinformative.

APA, Harvard, Vancouver, ISO, and other styles

13

Al-Shiakhli, Sarah. "Big Data Analytics: A Literature Review Perspective." Thesis, Luleå tekniska universitet, Institutionen för system- och rymdteknik, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:ltu:diva-74173.

Full text

Abstract:

Big data is currently a buzzword in both academia and industry, with the term being used todescribe a broad domain of concepts, ranging from extracting data from outside sources, storingand managing it, to processing such data with analytical techniques and tools.This thesis work thus aims to provide a review of current big data analytics concepts in an attemptto highlight big data analytics’ importance to decision making.Due to the rapid increase in interest in big data and its importance to academia, industry, andsociety, solutions to handling data and extracting knowledge from datasets need to be developedand provided with some urgency to allow decision makers to gain valuable insights from the variedand rapidly changing data they now have access to. Many companies are using big data analyticsto analyse the massive quantities of data they have, with the results influencing their decisionmaking. Many studies have shown the benefits of using big data in various sectors, and in thisthesis work, various big data analytical techniques and tools are discussed to allow analysis of theapplication of big data analytics in several different domains.

APA, Harvard, Vancouver, ISO, and other styles

14

Hellström, Hampus, and Oscar Ohm. "Big Data - Stort intresse, nya möjligheter." Thesis, Malmö högskola, Fakulteten för teknik och samhälle (TS), 2014. http://urn.kb.se/resolve?urn=urn:nbn:se:mau:diva-20307.

Full text

Abstract:

Dagens informationssamhälle har bidragit till att människor, maskiner och företag genererar och lagrar stora mängder data. Hanteringen och bearbetningen av de stora datamängderna har fått samlingsnamnet Big Data.De stora datamängderna ökar bland annat möjligheterna att bedriva kunskapsbaserad verksamhetsutveckling. Med traditionella metoder för insamling och analys av data har kunskapsbaserad verksamhetsutveckling tillämpats genom att skicka ut resurskrävande marknadsundersökningar och kartläggningar, ofta genomförda av specialiserade undersökningsföretag. Efterhand som analyser av samhällets befintliga datamängder blir allt värdefullare,har undersökningsföretagen därmed en stor utvecklingsmöjlighet att vaska guld ifrån samhällets enorma datamängder.Studien är genomförd som en explorativ fallstudie som undersöker hur svenska undersökningsföretag arbetar med Big Data och identifierar några av de utmaningar de står inför tillämpningen av Big Data analyser i verksamheten. Resultatet visar att de deltagande undersökningsföretagen använder Big Data som verktyg för att effektivisera befintliga processer och i viss mån komplettera traditionella undersökningar. Trots att man ser möjligheter med tekniken arbetar man passivt med utvecklingen av nya processer som ämnas stödjas av Big data analyser. Och det finns en utmaning i en bristande kompetens som råder på marknaden. Resultatet behandlar även en etisk aspekt undersökningsföretagen måste ta hänsyn till, speciellt problematiskt är den när data behandlas och analyseras i realtid och kan kopplas till en individ.
Today’s information society is consisting of people, businesses and machinesthat together generate large amounts of data every day. This exponatial growthof datageneration has led to the creation of what we call Big Data. Amongother things the data produced, gathered and stored can be used bycompanies to practise knowledge based business development. Traditionallythe methods used for generating knowledge about a business environment andmarket has been timeconsuming and expensive and often conducted by aspecialized research company that carry out market research and surveys.Today the analysis of existing data sets is becoming increasingly valuable, andthe research companies have a great opportunity to mine value from societyshuge amounts of data.The study is designed as an exploratory case study that investigates how theresearch companies in Sweden work with these data sets, and identifies someof the challenges they face in the application of Big Data analysis in theirbusiness. The results shows that the participating research companies areusing Big Data tools to steamline existing business processes and to someextent use it as a complementary value to traditional research and surveys.Although they see possibilities with the technology, the participatingcompanies are unwilling to drive the development of new business processesthat are supported by Big Data analysis. There is a challenge identified in thelack of competence prevailing in the Swedish market. The result also coverssome of the ethical aspects research companies need to take intoconsideration. The ethical issues are especially problematic when data, thatcan be linked to an individual, is processed and analysed in real time.

APA, Harvard, Vancouver, ISO, and other styles

15

Huttanus, Herbert M. "Screening and Engineering Phenotypes using Big Data Systems Biology." Diss., Virginia Tech, 2019. http://hdl.handle.net/10919/102706.

Full text

Abstract:

Biological systems display remarkable complexity that is not properly accounted for in small, reductionistic models. Increasingly, big data approaches using genomics, proteomics, metabolomics etc. are being applied to predicting and modifying the emergent phenotypes produced by complex biological systems. In this research, several novel tools were developed to assist in the acquisition and analysis of biological big data for a variety of applications. In total, two entirely new tools were created and a third, relatively new method, was evaluated by applying it to questions of clinical importance. 1) To assist in the quantification of metabolites at the subcellular level, a strategy for localized in-vivo enzymatic assays was proposed. A proof of concept for this strategy was conducted in which the local availability of acetyl-CoA in the peroxisomes of yeast was quantified by the production of polyhydroxybutyrate (PHB) using three heterologous enzymes. The resulting assay demonstrated the differences in acetyl-CoA availability in the peroxisomes under various culture conditions and genetic alterations. 2) To assist in the design of genetically modified microbe strains that are stable over many generations, software was developed to automate the selection of gene knockouts that would result in coupling cellular growth with production of a desired chemical. This software, called OptQuick, provides advantages over contemporary software for the same purpose. OptQuick can run considerably faster and uses a free optimization solver, GLPK. Knockout strategies generated by OptQuick were compared to case studies of similar strategies produced by contemporary programs. In these comparisons, OptQuick found many of the same gene targets for knockout. 3) To provide an inexpensive and non-invasive alternative for bladder cancer screening, Raman-based urinalysis was performed on clinical urine samples using RametrixTM software. RametrixTM has been previously developed and employed to other urinalysis applications, but this study was the first instance of applying this new technology to bladder cancer screening. Using a pool of 17 bladder cancer positive urine samples and 39 clinical samples exhibiting a range of healthy or other genitourinary disease phenotypes, RametrixTM was able to detect bladder cancer with a sensitivity of 94% and a specificity of 54%. 4) Methods for urine sample preservation were tested with regard to their effect on subsequent analysis with RametrixTM. Specifically, sterile filtration was tested as a potential method for extending the duration at which samples may be kept at room temperature prior to Raman analysis. Sterile filtration was shown to alter the chemical profile initially, but did not prevent further shifts in chemical profile over time. In spite of this, both unfiltered and filtered urine samples alike could be used for screening for chronic kidney disease or bladder cancer even after being stored for 2 weeks at room temperature, making sterile filtration largely unnecessary.
Doctor of Philosophy

APA, Harvard, Vancouver, ISO, and other styles

16

Smith, Derik Lafayette, and Satya Prakash Dhavala. "Using big data for decisions in agricultural supply chain." Thesis, Massachusetts Institute of Technology, 2013. http://hdl.handle.net/1721.1/81106.

Full text

Abstract:

Thesis (M. Eng. in Logistics)--Massachusetts Institute of Technology, Engineering Systems Division, 2013.
Cataloged from PDF version of thesis.
Includes bibliographical references (p. 53-54).
Agriculture is an industry where historical and current data abound. This paper investigates the numerous data sources available in the agricultural field and analyzes them for usage in supply chain improvement. We identified certain applicable data and investigated methods of using this data to make better supply chain decisions within the agricultural chemical distribution chain. We identified a specific product, AgChem, for this study. AgChem, like many agricultural chemicals, is forecasted and produced months in advance of a very short sales window. With improved demand forecasting based on abundantly-available data, Dow AgroSciences, the manufacturer of AgChem, can make better production and distribution decisions. We analyzed various data to identify factors that influence AgChem sales. Many of these factors relate to corn production since AgChem is generally used with corn crops. Using regression models, we identified leading indicators that assist to forecast future demand of the product. We developed three regressions models to forecast demand on various horizons. The first model identified that the price of corn and price of fertilizer affect the annual, nation-wide demand for the product. The second model explains expected geographic distribution of this annual demand. It shows that the number of retailers in an area is correlated to the total annual demand in that area. The model also quantifies the relationship between the sales in the first few weeks of the season, and the total sales for the season. And the third model serves as a short-term, demand-sensing tool to predict the timing of the demand within certain geographies. We found that weather conditions and the timing of harvest affect when AgChem sales occur. With these models, Dow AgroSciences has a better understanding of how external factors influence the sale of AgChem. With this new understanding, they can make better decisions about the distribution of the product and position inventory in a timely manner at the source of demand.
by Derik Lafayette Smith and Satya Prakash Dhavala.
M.Eng.in Logistics

APA, Harvard, Vancouver, ISO, and other styles

17

Lu, Feng. "Big data scalability for high throughput processing and analysis of vehicle engineering data." Thesis, KTH, Skolan för informations- och kommunikationsteknik (ICT), 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-207084.

Full text

Abstract:

"Sympathy for Data" is a platform that is utilized for Big Data automation analytics. It is based on visual interface and workflow configurations. The main purpose of the platform is to reuse parts of code for structured analysis of vehicle engineering data. However, there are some performance issues on a single machine for processing a large amount of data in Sympathy for Data. There are also disk and CPU IO intensive issues when the data is oversized and the platform need fits comfortably in memory. In addition, for data over the TB or PB level, the Sympathy for data needs separate functionality for efficient processing simultaneously and scalable for distributed computation functionality. This paper focuses on exploring the possibilities and limitations in using the Sympathy for Data platform in various data analytic scenarios within the Volvo Cars vision and strategy. This project re-writes the CDE workflow for over 300 nodes into pure Python script code and make it executable on the Apache Spark and Dask infrastructure. We explore and compare both distributed computing frameworks implemented on Amazon Web Service EC2 used for 4 machine with a 4x type for distributed cluster measurement. However, the benchmark results show that Spark is superior to Dask from performance perspective. Apache Spark and Dask will combine with Sympathy for Data products for a Big Data processing engine to optimize the system disk and CPU IO utilization. There are several challenges when using Spark and Dask to analyze large-scale scientific data on systems. For instance, parallel file systems are shared among all computing machines, in contrast to shared-nothing architectures. Moreover, accessing data stored in commonly used scientific data formats, such as HDF5 is not tentatively supported in Spark. This report presents research carried out on the next generation of Big Data platforms in the automotive industry called "Sympathy for Data". The research questions focusing on improving the I/O performance and scalable distributed function to promote Big Data analytics. During this project, we used the Dask.Array parallelism features for interpretation the data sources as a raster shows in table format, and Apache Spark used as data processing engine for parallelism to load data sources to memory for improving the big data computation capacity. The experiments chapter will demonstrate 640GB of engineering data benchmark for single node and distributed computation mode to evaluate the Sympathy for Data Disk CPU and memory metrics. Finally, the outcome of this project improved the six times performance of the original Sympathy for data by developing a middleware SparkImporter. It is used in Sympathy for Data for distributed computation and connected to the Apache Spark for data processing through the maximum utilization of the system resources. This improves its throughput, scalability, and performance. It also increases the capacity of the Sympathy for data to process Big Data and avoids big data cluster infrastructures.

APA, Harvard, Vancouver, ISO, and other styles

18

Stjerna, Albin. "Medium Data on Big Data Predicting Disk Failures in CERNs NetApp-based Data Storage System." Thesis, Uppsala universitet, Institutionen för informationsteknologi, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-337638.

Full text

Abstract:

I describe in this report an experimental system for using classification and regression trees to generate predictions of disk failures in a NetApp-based storage system at the European Organisation for Nuclear Research (CERN) based on a mixture of SMART data, system logs, and low-level system performance dataparticular to NetApp's storage solutions. Additionally, I make an attempt at profiling the system's built-in failure prediction method, and compiling statistics on historical complete-disk failures as well as bad blocks developed. Finally, I experiment with various parameters for producing classification trees and end up with two candidate models which have a true-positive rate of 86% with a false-alarm rate of 4% or atrue-positive rate of 71% and a false-alarm rate of 0.9% respectively, illustrating that classification trees might be a viable method for predicting real-life disk failures in CERNs storage systems.

APA, Harvard, Vancouver, ISO, and other styles

19

Bao, Shunxing. "Algorithmic Enhancements to Data Colocation Grid Frameworks for Big Data Medical Image Processing." Thesis, Vanderbilt University, 2019. http://pqdtopen.proquest.com/#viewpdf?dispub=13877282.

Full text

Abstract:

Large-scale medical imaging studies to date have predominantly leveraged in-house, laboratory-based or traditional grid computing resources for their computing needs, where the applications often use hierarchical data structures (e.g., Network file system file stores) or databases (e.g., COINS, XNAT) for storage and retrieval. The resulting performance for laboratory-based approaches reveal that performance is impeded by standard network switches since typical processing can saturate network bandwidth during transfer from storage to processing nodes for even moderate-sized studies. On the other hand, the grid may be costly to use due to the dedicated resources used to execute the tasks and lack of elasticity. With increasing availability of cloud-based big data frameworks, such as Apache Hadoop, cloud-based services for executing medical imaging studies have shown promise.

Despite this promise, our studies have revealed that existing big data frameworks illustrate different performance limitations for medical imaging applications, which calls for new algorithms that optimize their performance and suitability for medical imaging. For instance, Apache HBases data distribution strategy of region split and merge is detrimental to the hierarchical organization of imaging data (e.g., project, subject, session, scan, slice). Big data medical image processing applications involving multi-stage analysis often exhibit significant variability in processing times ranging from a few seconds to several days. Due to the sequential nature of executing the analysis stages by traditional software technologies and platforms, any errors in the pipeline are only detected at the later stages despite the sources of errors predominantly being the highly compute-intensive first stage. This wastes precious computing resources and incurs prohibitively higher costs for re-executing the application. To address these challenges, this research propose a framework - Hadoop & HBase for Medical Image Processing (HadoopBase-MIP) - which develops a range of performance optimization algorithms and employs a number of system behaviors modeling for data storage, data access and data processing. We also introduce how to build up prototypes to help empirical system behaviors verification. Furthermore, we introduce a discovery with the development of HadoopBase-MIP about a new type of contrast for medical imaging deep brain structure enhancement. And finally we show how to move forward the Hadoop based framework design into a commercialized big data / High performance computing cluster with cheap, scalable and geographically distributed file system.

APA, Harvard, Vancouver, ISO, and other styles

20

Jiang, Yiming. "Automated Generation of CAD Big Data for Geometric Machine Learning." The Ohio State University, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=osu1576329384392725.

Full text

APA, Harvard, Vancouver, ISO, and other styles

21

Moran, Andrew M. Eng Massachusetts Institute of Technology. "Improving big data visual analytics with interactive virtual reality." Thesis, Massachusetts Institute of Technology, 2016. http://hdl.handle.net/1721.1/105972.

Full text

Abstract:

Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2016.
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Cataloged from student-submitted PDF version of thesis.
Includes bibliographical references (pages 80-84).
For decades, the growth and volume of digital data collection has made it challenging to digest large volumes of information and extract underlying structure. Coined 'Big Data', massive amounts of information has quite often been gathered inconsistently (e.g from many sources, of various forms, at different rates, etc.). These factors impede the practices of not only processing data, but also analyzing and displaying it in an efficient manner to the user. Many efforts have been completed in the data mining and visual analytics community to create effective ways to further improve analysis and achieve the knowledge desired for better understanding. Our approach for improved big data visual analytics is two-fold, focusing on both visualization and interaction. Given geo-tagged information, we are exploring the benefits of visualizing datasets in the original geospatial domain by utilizing a virtual reality platform. After running proven analytics on the data, we intend to represent the information in a more realistic 3D setting, where analysts can achieve an enhanced situational awareness and rely on familiar perceptions to draw in-depth conclusions on the dataset. In addition, developing a human-computer interface that responds to natural user actions and inputs creates a more intuitive environment. Tasks can be performed to manipulate the dataset and allow users to dive deeper upon request, adhering to desired demands and intentions. Due to the volume and popularity of social media, we developed a 3D tool visualizing Twitter on MIT's campus for analysis. Utilizing emerging technologies of today to create a fully immersive tool that promotes visualization and interaction can help ease the process of understanding and representing big data.
by Andrew Moran.
M. Eng.

APA, Harvard, Vancouver, ISO, and other styles

22

Jun, Sang-Woo. "Scalable multi-access flash store for Big Data analytics." Thesis, Massachusetts Institute of Technology, 2014. http://hdl.handle.net/1721.1/87947.

Full text

Abstract:

Thesis: S.M., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2014.
Cataloged from PDF version of thesis.
Includes bibliographical references (pages 47-49).
For many "Big Data" applications, the limiting factor in performance is often the transportation of large amount of data from hard disks to where it can be processed, i.e. DRAM. In this work we examine an architecture for a scalable distributed flash store which aims to overcome this limitation in two ways. First, the architecture provides a high-performance, high-capacity, scalable random-access storage. It achieves high-throughput by sharing large numbers of flash chips across a low-latency, chip-to-chip backplane network managed by the flash controllers. The additional latency for remote data access via this network is negligible as compared to flash access time. Second, it permits some computation near the data via a FPGA-based programmable flash controller. The controller is located in the datapath between the storage and the host, and provides hardware acceleration for applications without any additional latency. We have constructed a small-scale prototype whose network bandwidth scales directly with the number of nodes, and where average latency for user software to access flash store is less than 70[mu]s, including 3.5[mu]s of network overhead.
by Sang-Woo Jun.
S.M.

APA, Harvard, Vancouver, ISO, and other styles

23

Hansson, Karakoca Josef. "Big Data Types : Internally Parallel in an Actor Language." Thesis, Uppsala universitet, Institutionen för informationsteknologi, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-372248.

Full text

Abstract:

Around year 2005 the hardware industry hit a power wall. It was no longer possible to drastically increasing computer performance through decreasing the transistors' size or increasing the clock-speed of the CPU. To ensure future development multi-core processors became the way to go. The Programming Languages Group at Uppsala University is developing a programming language called Encore that is developed to be scalable to future machines with a few hundred or even thousand processor cores. This thesis reports on the design and implementation of Big data types. Big data types are locally distributed data structures that allow internal parallelism in the actor model by using several actors in their implementations. Thus, rather than serializing all interaction these data structures are potentially as parallel as the number of actors used to construct them. The goal of Big data types is to provide a tool that makes it easier for an Encore programmer to create parallel and concurrent programs. As part of our evaluation, we have implemented a Mapreduce framework which showcase how of Big data types could be used in a more complex program.

APA, Harvard, Vancouver, ISO, and other styles

24

Lindberg, Johan. "Big Data och Hadoop : Nästa generation av lagring." Thesis, Mittuniversitetet, Avdelningen för informationssystem och -teknologi, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:miun:diva-31079.

Full text

Abstract:

The goal of this report and study is to at a theoretical level determine the possi- bilities for Försäkringskassan IT to change platform for storage of data used in their daily activities. Försäkringskassan collects immense amounts of data ev- eryday containing personal information, lines of programming code, payments and customer service tickets. Today, everything is stored in large relationship databases which leads to problems with scalability and performance. The new platform studied in this report is built on a storage technology named Hadoop. Hadoop is developed to store and process data distributed in what is called clus- ters. Clusters that consists of commodity server hardware. The platform promises near linear scalability, possibility to store all data with a high fault tolerance and that it can handle massive amounts of data. The study is done through theo- retical studies as well as a proof of concept. The theory studies focus on the background of Hadoop, it’s structure and what to expect in the future. The plat- form being used at Försäkringskassan today is to be specified and compared to the new platform. A proof of concept will be conducted in a test environment at Försäkringskassan running a Hadoop platform from Hortonworks. Its purpose is to show how storing data is done as well as to show that unstructured data can be stored. The study shows that no theoretical problems have been found and that a move to the new platform should be possible. It does however move handling of the data from before storage to after. This is because todays platform is reliant on relationship databases that require data to be structured neatly to be stored. Hadoop however stores all data but require more work and knowledge to retrieve the data.
Målet med rapporten och undersökningen är att på en teoretisk nivå undersöka möjligheterna för Försäkringskassan IT att byta plattform för lagring av data och information som används i deras dagliga arbete. Försäkringskassan samlar på sig oerhörda mängder data på daglig basis innehållandes allt från personupp- gifter, programkod, utbetalningar och kundtjänstärenden. Idag lagrar man allt detta i stora relationsdatabaser vilket leder till problem med skalbarhet och prestanda. Den nya plattformen som undersöks bygger på en lagringsteknik vid namn Hadoop. Hadoop är utvecklat för att både lagra och processerna data distribuerat över så kallade kluster bestående av billigare serverhårdvara. Plattformen utlovar näst intill linjär skalbarhet, möjlighet att lagra all data med hög feltolerans samt att hantera enorma datamängder. Undersökningen genomförs genom teoristudier och ett proof of concept. Teoristudierna fokuserar på bakgrunden på Hadoop, dess uppbyggnad och struktur samt hur framtiden ser ut. Dagens upplägg för lagring hos Försäkringskassan specificeras och jämförs med den nya plattformen. Ett proof of concept genomförs på en testmiljö hos För- säkringskassan där en Hadoop plattform från Hortonworks används för att påvi- sa hur lagring kan fungera samt att så kallad ostrukturerad data kan lagras. Undersökningen påvisar inga teoretiska problem i att byta till den nya plattformen. Dock identifieras ett behov av att flytta hanteringen av data från inläsning till utläsning. Detta beror på att dagens lösning med relationsdatabaser kräver väl strukturerad data för att kunna lagra den medan Hadoop kan lagra allt utan någon struktur. Däremot kräver Hadoop mer handpåläggning när det kommer till att hämta data och arbeta med den.

APA, Harvard, Vancouver, ISO, and other styles

25

Toole, Jameson Lawrence. "Putting big data in its place : understanding cities and human mobility with new data sources." Thesis, Massachusetts Institute of Technology, 2015. http://hdl.handle.net/1721.1/98631.

Full text

Abstract:

Thesis: Ph. D., Massachusetts Institute of Technology, Engineering Systems Division, June 2015.
Cataloged from PDF version of thesis. "February 2015."
Includes bibliographical references (pages 223-241).
According the United Nations Population Fund (UNFPA), 2008 marked the first year in which the majority of the planet's population lived in cities. Urbanization, already over 80% in many western regions, is increasing rapidly as migration into cities continue. The density of cities provides residents access to places, people, and goods, but also gives rise to problems related to health, congestion, and safety. In parallel to rapid urbanization, ubiquitous mobile computing, namely the pervasive use of cellular phones, has generated a wealth of data that can be analyzed to understand and improve urban systems. These devices and the applications that run on them passively record social, mobility, and a variety of other behaviors of their users with extremely high spatial and temporal resolution. This thesis presents a variety of novel methods and analyses to leverage the data generated from these devices to understand human behavior within cities. It details new ways to measure and quantify human behaviors related to mobility, social influence, and economic outcomes.
by Jameson Lawrence Toole.
Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

26

Bhagattjee, Benoy. "Emergence and taxonomy of big data as a service." Thesis, Massachusetts Institute of Technology, 2014. http://hdl.handle.net/1721.1/90709.

Full text

Abstract:

Thesis: S.M. in Engineering and Management, Massachusetts Institute of Technology, Engineering Systems Division, System Design and Management Program, 2014.
Cataloged from PDF version of thesis.
Includes bibliographical references (pages 82-83).
The amount of data that we produce and consume is growing exponentially in the modem world. Increasing use of social media and new innovations such as smartphones generate large amounts of data that can yield invaluable information if properly managed. These large datasets, popularly known as Big Data, are difficult to manage using traditional computing technologies. New technologies are emerging in the market to address the problem of managing and analyzing Big Data to produce invaluable insights from it. Organizations are finding it difficult to implement these Big Data technologies effectively due to problems such as lack of available expertise. Some of the latest innovations in the industry are related to cloud computing and Big Data. There is significant interest in academia and industry in combining Big Data and cloud computing to create new technologies that can solve the Big Data problem. Big Data based on cloud computing is an upcoming area in computer science and many vendors are providing their ideas on this topic. The combination of Big Data technologies and cloud computing platforms has led to the emergence of a new category of technology called Big Data as a Service or BDaaS. This thesis aims to define the BDaaS service stack and to evaluate a few technologies in the cloud computing ecosystem using the BDaaS service stack. The BDaaS service stack provides an effective way to classify the Big Data technologies that enable technology users to evaluate and chose the technology that meets their requirements effectively. Technology vendors can use the same BDaaS stack to communicate the product offerings better to the consumer.
by Benoy Bhagattjee.
S.M. in Engineering and Management

APA, Harvard, Vancouver, ISO, and other styles

27

Jun, Sang-Woo. "Big data analytics made affordable using hardware-accelerated flash storage." Thesis, Massachusetts Institute of Technology, 2018. http://hdl.handle.net/1721.1/118088.

Full text

Abstract:

Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2018.
Cataloged from PDF version of thesis.
Includes bibliographical references (pages 175-192).
Vast amount of data is continuously being collected from sources including social networks, web pages, and sensor networks, and their economic value is dependent on our ability to analyze them in a timely and affordable manner. High performance analytics have traditionally required a machine or a cluster of machines with enough DRAM to accommodate the entire working set, due to their need for random accesses. However, datasets of interest are now regularly exceeding terabytes in size, and the cost of purchasing and operating a cluster with hundreds of machines is becoming a significant overhead. Furthermore, the performance of many random-access-intensive applications plummets even when a fraction of data does not fit in memory. On the other hand, such datasets could be stored easily in the flash-based secondary storage of a rack-scale cluster, or even a single machine for a fraction of capital and operating costs. While flash storage has much better performance compared to hard disks, there are many hurdles to overcome in order to reach the performance of DRAM-based clusters. This thesis presents a new system architecture as well as operational methods that enable flash-based systems to achieve performance comparable to much costlier DRAM-based clusters for many important applications. We describe a highly customizable architecture called BlueDBM, which includes flash storage devices augmented with in-storage hardware accelerators, networked using a separate storage-area network. Using a prototype BlueDBM cluster with custom-designed accelerated storage devices, as well as novel accelerator designs and storage management algorithms, we have demonstrated high performance at low cost for applications including graph analytics, sorting, and database operations. We believe this approach to handling Big Data analytics is an attractive solution to the cost-performance issue of Big Data analytics.
by Sang-Woo Jun.
Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

28

Battle, Leilani Marie. "Interactive visualization of big data leveraging databases for scalable computation." Thesis, Massachusetts Institute of Technology, 2013. http://hdl.handle.net/1721.1/84906.

Full text

Abstract:

Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2013.
Cataloged from PDF version of thesis.
Includes bibliographical references (pages 55-57).
Modern database management systems (DBMS) have been designed to efficiently store, manage and perform computations on massive amounts of data. In contrast, many existing visualization systems do not scale seamlessly from small data sets to enormous ones. We have designed a three-tiered visualization system called ScalaR to deal with this issue. ScalaR dynamically performs resolution reduction when the expected result of a DBMS query is too large to be effectively rendered on existing screen real estate. Instead of running the original query, ScalaR inserts aggregation, sampling or filtering operations to reduce the size of the result. This thesis presents the design and implementation of ScalaR, and shows results for two example applications, visualizing earthquake records and satellite imagery data, stored in SciDB as the back-end DBMS.
by Leilani Marie Battle.
S.M.

APA, Harvard, Vancouver, ISO, and other styles

29

Wu, Sherwin Zhang. "Sifter : a generalized, efficient, and scalable big data corpus generator." Thesis, Massachusetts Institute of Technology, 2015. http://hdl.handle.net/1721.1/100684.

Full text

Abstract:

Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2015.
Cataloged from PDF version of thesis.
Includes bibliographical references (page 61).
Big data has reached the point where the volume, velocity, and variety of data place significant limitations on the computer systems which process and analyze them. Working with very large data sets has becoming increasingly unweildly. Therefore, our goal was to create a system that can support efficient extraction of data subsets to a size that can be manipulated on a single machine. Sifter was developed as a big data corpus generator for scientists to generate these smaller datasets from an original larger one. Sifter's three-layer architecture allows for client users to easily create their own custom data corpus jobs, while allowing administrative users to easily integrate additional core data sets into Sifter. This thesis presents the implemented Sifter system deployed on an initial Twitter dataset. We further show how we added support for a secondary MIMIC medical dataset, as well as demonstrate the scalability of Sifter with very large datasets.
M. Eng.

APA, Harvard, Vancouver, ISO, and other styles

30

Eigner, Martin. "Das Industrial Internet – Engineering Prozesse und IT-Lösungen." Saechsische Landesbibliothek- Staats- und Universitaetsbibliothek Dresden, 2016. http://nbn-resolving.de/urn:nbn:de:bsz:14-qucosa-214588.

Full text

Abstract:

Das Engineering unterliegt derzeit einem massiven Wandel. Smarte Systeme und Technologien, Cybertronische Produkte, Big Data und Cloud Computing im Kontext des Internet der Dinge und Dienste sowie Industrie 4.0. Der amerikanische Ansatz des „Industrial Internet“ beschreibt diese (R)evolution jedoch weitaus besser als der eingeschränkte und stark deutsch geprägte Begriff Industrie 4.0. Industrial Internet berücksichtigt den gesamten Produktlebenszyklus und adressiert sowohl Konsum- und Investitionsgüter als auch Dienstleistungen. Dieser Beitrag beleuchtet das zukunftsträchtige Trendthema und bietet fundierte Einblicke in die vernetzte Engineering-Welt von morgen, auf Ihre Konstruktionsmethoden und –prozesse sowie auf die IT-Lösungen.

APA, Harvard, Vancouver, ISO, and other styles

31

Backurs, Arturs. "Below P vs NP : fine-grained hardness for big data problems." Thesis, Massachusetts Institute of Technology, 2018. http://hdl.handle.net/1721.1/120376.

Full text

Abstract:

Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2018.
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Cataloged from student-submitted PDF version of thesis.
Includes bibliographical references (pages 145-156).
The theory of NP-hardness has been remarkably successful in identifying problems that are unlikely to be solvable in polynomial time. However, many other important problems do have polynomial-time algorithms, but large exponents in their runtime bounds can make them inefficient in practice. For example, quadratic-time algorithms, although practical on moderately sized inputs, can become inefficient on big data problems that involve gigabytes or more of data. Although for many data analysis problems no sub-quadratic time algorithms are known, any evidence of quadratic-time hardness has remained elusive. In this thesis we present hardness results for several text analysis and machine learning tasks: ** Lower bounds for edit distance, regular expression matching and other pattern matching and string processing problems. ** Lower bounds for empirical risk minimization such as kernel support vectors machines and other kernel machine learning problems. All of these problems have polynomial time algorithms, but despite extensive amount of research, no near-linear time algorithms have been found. We show that, under a natural complexity-theoretic conjecture, such algorithms do not exist. We also show how these lower bounds have inspired the development of efficient algorithms for some variants of these problems.
by Arturs Backurs.
Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

32

Bunpuckdee, Bhadin, and Ömer Tekbas. "Ideation with Big Data : A case study of a large mature firm." Thesis, KTH, Maskinkonstruktion (Inst.), 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-277732.

Full text

Abstract:

Big Data has in recent years gained much attention and interest from organizations. The rise of recent technologies has enabled data to be processed and stored in a simpler manner, thus asking organizations what value Big Data can bring to the organization. However, collecting Big Data does not automatically generate business opportunities; organizations need to understand how to process Big Data and how to implement the insights. To enable this, new competences are needed, and firms need to adapt into more co-innovated constellations. The purpose of this study is to investigate what innovation processes a team data-expert team working cross-functional uses to ideate possible business opportunities. Furthermore, the aim is to propose recommendations of how an organization can become more efficient when ideating. The case study was carried out for a large established company within Auditing and more specifically in a support department with expertise in data analytics, automation and artificial intelligence. The data was collected through internal interviews within the department, Department A. The case study resulted in recommendation of what to consider when ideating with Big Data. Key aspects to consider is that Big Data enables co-innovation to prosper and therefore conjoining customers, domain experts and Big Data experts is crucial for successful Ideation. Moreover, an understanding of different innovation aspects will thus help organizations understand how to ideate with Big Data more efficiently.
Big Data har under senaste åren fått mycket uppmärksamhet. Utvecklingen av olika teknologier har möjliggjort att en stor mängd data kan behandlas och förvaras enklare. Detta har gjort att företag har funderat över hur Big Data kan vara värdeskapande. Däremot är det inte självklart att Big Data automatiskt genererar affärsmöjligheter; företag måste förstå hur man ska förädla data och implementera insikterna. För att möjliggöra detta måste nya kompetenser införskaffas och företag måste anpassa sig till en mer medskapande arbetsstruktur. Detta arbetets ändamål är att undersöka vilka innovationsprocesser en avdelning med data-experter som jobbar tvärfunktionellt i en organisation använder för att idégenerera för nya affärsmöjligheter. Målet är att ge rekommendationer hur företag kan bli mer effektiva vid idégenerering. Denna fallstudie utfördes för ett stort etablerat företag inom revision och inom en avdelning med expertis inom dataanalys, automation och artificiell intelligens. Datan i denna rapport införskaffades genom interna intervjuer från på avdelningen A. Fallstudien resulterade i rekommendationer på vad som behövs att ha i åtanke vid idégenerering med Big Data. Viktiga aspekter att överväga är att Big Data möjliggör medskapande och därför är det ytterst viktigt att kunder, domänexperter och Big Data experter idégenererar tillsammans.

APA, Harvard, Vancouver, ISO, and other styles

33

Landelius, Cecilia. "Data governance in big data : How to improve data quality in a decentralized organization." Thesis, KTH, Industriell ekonomi och organisation (Inst.), 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-301258.

Full text

Abstract:

The use of internet has increased the amount of data available and gathered. Companies are investing in big data analytics to gain insights from this data. However, the value of the analysis and decisions made based on it, is dependent on the quality ofthe underlying data. For this reason, data quality has become a prevalent issue for organizations. Additionally, failures in data quality management are often due to organizational aspects. Due to the growing popularity of decentralized organizational structures, there is a need to understand how a decentralized organization can improve data quality. This thesis conducts a qualitative single case study of an organization currently shifting towards becoming data driven and struggling with maintaining data quality within the logistics industry. The purpose of the thesis is to answer the questions: • RQ1: What is data quality in the context of logistics data? • RQ2: What are the obstacles for improving data quality in a decentralized organization? • RQ3: How can these obstacles be overcome? Several data quality dimensions were identified and categorized as critical issues,issues and non-issues. From the gathered data the dimensions completeness, accuracy and consistency were found to be critical issues of data quality. The three most prevalent obstacles for improving data quality were data ownership, data standardization and understanding the importance of data quality. To overcome these obstacles the most important measures are creating data ownership structures, implementing data quality practices and changing the mindset of the employees to a data driven mindset. The generalizability of a single case study is low. However, there are insights and trends which can be derived from the results of this thesis and used for further studies and companies undergoing similar transformations.
Den ökade användningen av internet har ökat mängden data som finns tillgänglig och mängden data som samlas in. Företag påbörjar därför initiativ för att analysera dessa stora mängder data för att få ökad förståelse. Dock är värdet av analysen samt besluten som baseras på analysen beroende av kvaliteten av den underliggande data. Av denna anledning har datakvalitet blivit en viktig fråga för företag. Misslyckanden i datakvalitetshantering är ofta på grund av organisatoriska aspekter. Eftersom decentraliserade organisationsformer blir alltmer populära, finns det ett behov av att förstå hur en decentraliserad organisation kan arbeta med frågor som datakvalitet och dess förbättring. Denna uppsats är en kvalitativ studie av ett företag inom logistikbranschen som i nuläget genomgår ett skifte till att bli datadrivna och som har problem med att underhålla sin datakvalitet. Syftet med denna uppsats är att besvara frågorna: • RQ1: Vad är datakvalitet i sammanhanget logistikdata? • RQ2: Vilka är hindren för att förbättra datakvalitet i en decentraliserad organisation? • RQ3: Hur kan dessa hinder överkommas? Flera datakvalitetsdimensioner identifierades och kategoriserades som kritiska problem, problem och icke-problem. Från den insamlade informationen fanns att dimensionerna, kompletthet, exakthet och konsekvens var kritiska datakvalitetsproblem för företaget. De tre mest förekommande hindren för att förbättra datakvalité var dataägandeskap, standardisering av data samt att förstå vikten av datakvalitet. För att överkomma dessa hinder är de viktigaste åtgärderna att skapa strukturer för dataägandeskap, att implementera praxis för hantering av datakvalitet samt att ändra attityden hos de anställda gentemot datakvalitet till en datadriven attityd. Generaliseringsbarheten av en enfallsstudie är låg. Dock medför denna studie flera viktiga insikter och trender vilka kan användas för framtida studier och för företag som genomgår liknande transformationer.

APA, Harvard, Vancouver, ISO, and other styles

34

Islam, Md Zahidul. "A Cloud Based Platform for Big Data Science." Thesis, Linköpings universitet, Programvara och system, 2014. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-103700.

Full text

Abstract:

With the advent of cloud computing, resizable scalable infrastructures for data processing is now available to everyone. Software platforms and frameworks that support data intensive distributed applications such as Amazon Web Services and Apache Hadoop enable users to the necessary tools and infrastructure to work with thousands of scalable computers and process terabytes of data. However writing scalable applications that are run on top of these distributed frameworks is still a demanding and challenging task. The thesis aimed to advance the core scientific and technological means of managing, analyzing, visualizing, and extracting useful information from large data sets, collectively known as “big data”. The term “big-data” in this thesis refers to large, diverse, complex, longitudinal and/or distributed data sets generated from instruments, sensors, internet transactions, email, social networks, twitter streams, and/or all digital sources available today and in the future. We introduced architectures and concepts for implementing a cloud-based infrastructure for analyzing large volume of semi-structured and unstructured data. We built and evaluated an application prototype for collecting, organizing, processing, visualizing and analyzing data from the retail industry gathered from indoor navigation systems and social networks (Twitter, Facebook etc). Our finding was that developing large scale data analysis platform is often quite complex when there is an expectation that the processed data will grow continuously in future. The architecture varies depend on requirements. If we want to make a data warehouse and analyze the data afterwards (batch processing) the best choices will be Hadoop clusters and Pig or Hive. This architecture has been proven in Facebook and Yahoo for years. On the other hand, if the application involves real-time data analytics then the recommendation will be Hadoop clusters with Storm which has been successfully used in Twitter. After evaluating the developed prototype we introduced a new architecture which will be able to handle large scale batch and real-time data. We also proposed an upgrade of the existing prototype to handle real-time indoor navigation data.

APA, Harvard, Vancouver, ISO, and other styles

35

Akusok, Anton. "Extreme Learning Machines: novel extensions and application to Big Data." Diss., University of Iowa, 2016. https://ir.uiowa.edu/etd/3036.

Full text

Abstract:

Extreme Learning Machine (ELM) is a recently discovered way of training Single Layer Feed-forward Neural Networks with an explicitly given solution, which exists because the input weights and biases are generated randomly and never change. The method in general achieves performance comparable to Error Back-Propagation, but the training time is up to 5 orders of magnitude smaller. Despite a random initialization, the regularization procedures explained in the thesis ensure consistently good results. While the general methodology of ELMs is well developed, the sheer speed of the method enables its un-typical usage for state-of-the-art techniques based on repetitive model re-training and re-evaluation. Three of such techniques are explained in the third chapter: a way of visualizing high-dimensional data onto a provided fixed set of visualization points, an approach for detecting samples in a dataset with incorrect labels (mistakenly assigned, mistyped or a low confidence), and a way of computing confidence intervals for ELM predictions. All three methods prove useful, and allow even more applications in the future. ELM method is a promising basis for dealing with Big Data, because it naturally deals with the problem of large data size. An adaptation of ELM to Big Data problems, and a corresponding toolbox (published and freely available) are described in chapter 4. An adaptation includes an iterative solution of ELM which satisfies a limited computer memory constraints and allows for a convenient parallelization. Other tools are GPU-accelerated computations and support for a convenient huge data storage format. The chapter also provides two real-world examples of dealing with Big Data using ELMs, which present other problems of Big Data such as veracity and velocity, and solutions to them in the particular problem context.

APA, Harvard, Vancouver, ISO, and other styles

36

Dawany, Noor Tozeren Aydin. "Large-scale integration of microarray data : investigating the pathologies of cancer and infectious diseases /." Philadelphia, Pa. : Drexel University, 2010. http://hdl.handle.net/1860/3251.

Full text

APA, Harvard, Vancouver, ISO, and other styles

37

Kalila, Adham. "Big data fusion to estimate driving adoption behavior and urban fuel consumption." Thesis, Massachusetts Institute of Technology, 2018. http://hdl.handle.net/1721.1/119335.

Full text

Abstract:

Thesis: S.M. in Transportation, Massachusetts Institute of Technology, Department of Civil and Environmental Engineering, 2018.
Cataloged from PDF version of thesis.
Includes bibliographical references (pages 63-68).
Data from mobile phones is constantly increasing in accuracy, quantity, and ubiquity. Methods that utilize such data in the field of transportation demand forecasting have been proposed and represent a welcome addition. We propose a framework that uses the resulting travel demand and computes fuel consumption. The model is calibrated for application on any range of car fuel efficiency and combined with other sources of data to produce urban fuel consumption estimates for the city of Riyadh as an application. Targeted traffic congestion reduction strategies are compared to random traffic reduction and the results indicate a factor of 2 improvement on fuel savings. Moreover, an agent-based innovation adoption model is used with a network of women from Call Detail Records to simulate the time at which women may adopt driving after the ban on females driving is lifted in Saudi Arabia. The resulting adoption rates are combined with fuel costs from simulating empty driver trips to forecast the fuel savings potential of such a historic policy change.
by Adham Kalila.
S.M. in Transportation

APA, Harvard, Vancouver, ISO, and other styles

38

Abounia, Omran Behzad. "Application of Data Mining and Big Data Analytics in the Construction Industry." The Ohio State University, 2016. http://rave.ohiolink.edu/etdc/view?acc_num=osu148069742849934.

Full text

APA, Harvard, Vancouver, ISO, and other styles

39

Khalilikhah, Majid. "Traffic Sign Management: Data Integration and Analysis Methods for Mobile LiDAR and Digital Photolog Big Data." DigitalCommons@USU, 2016. https://digitalcommons.usu.edu/etd/4744.

Full text

Abstract:

This study links traffic sign visibility and legibility to quantify the effects of damage or deterioration on sign retroreflective performance. In addition, this study proposes GIS-based data integration strategies to obtain and extract climate, location, and emission data for in-service traffic signs. The proposed data integration strategy can also be used to assess all transportation infrastructures’ physical condition. Additionally, non-parametric machine learning methods are applied to analyze the combined GIS, Mobile LiDAR imaging, and digital photolog big data. The results are presented to identify the most important factors affecting sign visual condition, to predict traffic sign vandalism that obstructs critical messages to drivers, and to determine factors contributing to the temporary obstruction of the sign messages. The results of data analysis provide insight to inform transportation agencies in the development of sign management plans, to identify traffic signs with a higher likelihood of failure, and to schedule sign replacement.

APA, Harvard, Vancouver, ISO, and other styles

40

Purcaro, Michael J. "Analysis, Visualization, and Machine Learning of Epigenomic Data." eScholarship@UMMS, 2017. https://escholarship.umassmed.edu/gsbs_diss/938.

Full text

Abstract:

The goal of the Encyclopedia of DNA Elements (ENCODE) project has been to characterize all the functional elements of the human genome. These elements include expressed transcripts and genomic regions bound by transcription factors (TFs), occupied by nucleosomes, occupied by nucleosomes with modified histones, or hypersensitive to DNase I cleavage, etc. Chromatin Immunoprecipitation (ChIP-seq) is an experimental technique for detecting TF binding in living cells, and the genomic regions bound by TFs are called ChIP-seq peaks. ENCODE has performed and compiled results from tens of thousands of experiments, including ChIP-seq, DNase, RNA-seq and Hi-C. These efforts have culminated in two web-based resources from our lab—Factorbook and SCREEN—for the exploration of epigenomic data for both human and mouse. Factorbook is a peak-centric resource presenting data such as motif enrichment and histone modification profiles for transcription factor binding sites computed from ENCODE ChIP-seq data. SCREEN provides an encyclopedia of ~2 million regulatory elements, including promoters and enhancers, identified using ENCODE ChIP-seq and DNase data, with an extensive UI for searching and visualization. While we have successfully utilized the thousands of available ENCODE ChIP-seq experiments to build the Encyclopedia and visualizers, we have also struggled with the practical and theoretical inability to assay every possible experiment on every possible biosample under every conceivable biological scenario. We have used machine learning techniques to predict TF binding sites and enhancers location, and demonstrate machine learning is critical to help decipher functional regions of the genome.

APA, Harvard, Vancouver, ISO, and other styles

41

Li, Zhen. "CloudVista: a Framework for Interactive Visual Cluster Exploration of Big Data in the Cloud." Wright State University / OhioLINK, 2012. http://rave.ohiolink.edu/etdc/view?acc_num=wright1348204863.

Full text

APA, Harvard, Vancouver, ISO, and other styles

42

Pergert, Anton, and William George. "Teoretisk undersökning om relationen mellan Big Data och ekologisk hållbarhet i tillverkande industri." Thesis, KTH, Maskinkonstruktion (Inst.), 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-299636.

Full text

Abstract:

Industriella revolutionen hade sin begynnelse under mitten av 1700-talet. Idag befinner vi oss i början av den fjärde revolutionen, även känd som Industri 4.0 där smarta teknologier integreras i fabriker. Ett resultat av detta är insamlandet och hanteringen av stora mängder data, vilket introducerat Big Data i den tillverkande industrin. Samtidigt växer fokuset på ekologisk hållbarhet på grund av den ökade miljöförstöringen och utarmningen av naturliga resurser. Därför är en viktig aspekt av Industri 4.0 at implementera smarta teknologier som gör fabriker mer ekologiskt hållbara. Denna studie består av en teoretisk informationsinsamling, där information sammanstälts utifrån relevanta vetenskapliga publikationer. Desutom inkluderar studien intervjuer med relevanta företag och forskare. Utifrån dessa besvaras frågeställningen huruvida Big Data som smart teknologi, påverkar tillverkande företag, ur et ekologiskt hållbarhetsperspektiv. Resultaten visar at Big Data som smart teknologi kan bidra til en mer energieffektiv tillverkning, genom att samla in data och på olika sätt analysera och optimera proceser utifrån den insamlade informationen. Däremot kan tröskeln för teknologin vara hög, både i avseende på pris och kunskap. Vidare kan Big Data dessutom påskynda skiftet till en mer cirkulär ekonomi, genom att samla in data och ta informerade beslut gällande övergången till en mer cirkulär och ekologiskt hålbar tillverkning. Utöver detta kan Big Data implementeras i och underlätta cirkulära tjänster, som maskinuthyrning, som ersätter linjära och traditionella metoder, där varan köps, används och kaseras. Big Data kan även användas i form av prediktivt underhåll, vilket sparar in på ekologiska resurser genom at samla in och analysera realtidsdata för att ta beslut, som i sin tur kan öka livslängden på utrustningen. Detta minskar även mängden reservdelar och skrot. Studien visar därmed att Big Data kan bidra til en ökad ekologisk hålbarhet på flera vis.
The Industrial Revolution had its beginnings in the middle of the 18th century. Today we are athe beginning of the fourth industrial revolution, also known as Industry 4.0 where smart technologies are integrated in factories. One result of this is the collection and management of large amounts of data, which has introduced Big Data into the manufacturing industry. At the same time, the focus on ecological sustainability is growing due to the increased environmental degradation and depletion of natural resources. Therefore, an important aspect of Industry 4.0 is to implement smart technologies that make factories more ecologically sustainable. This study consists of a theory study, where information is compiled from relevant scientific publications. In adition, the study includes interviews with relevant companies and researchers. Based on these, the question whether Big Data as a smart technology, affects manufacturing companies, from an ecological sustainability perspective, is answered. The results show that Big Data is a smart technology can contribute to a more energy efficient production, by collecting data and in various ways analysing and optimizing processes based on the collected information. However, the threshold for the technology can be steep, both in terms of pricing and knowledge. Furthermore, Big Data can also accelerate the shift to a more circular economy, by collecting data and making informed decisions regarding the transition to a more circular and ecologically sustainable production. In adition, Big Data can facilitate and be implemented in circular services, such as machine rental, which replaces linear and traditional methods, where the product is purchased, used and discarded. Big Data can also be used in the form of predictive maintenance, which reduces the use of ecological resources by collecting and analysing real-time data to make decisions, which in turn can increase the service life of the equipment. This also reduces the amount of spare parts and scrap. The study therefore shows that Big Data can contribute to increased ecological sustainability in various ways.

APA, Harvard, Vancouver, ISO, and other styles

43

Kumlin, Jesper. "True operation simulation for urban rail : Energy efficiency from access to Big data." Thesis, Mälardalens högskola, Industriell ekonomi och organisation, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:mdh:diva-44264.

Full text

APA, Harvard, Vancouver, ISO, and other styles

44

Obeso, Duque Aleksandra. "Performance Prediction for Enabling Intelligent Resource Management on Big Data Processing Workflows." Thesis, Uppsala universitet, Institutionen för informationsteknologi, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-372178.

Full text

Abstract:

Mobile cloud computing offers an augmented infrastructure that allows resource-constrained devices to use remote computational resources as an enabler for highly intensive computation, thus improving end users experience. Being able to efficiently manage cloud elasticity represents a big challenge for dynamic resource scaling on-demand. In this sense, the development of intelligent tools that could ease the understanding of the behavior of a highly dynamic system and to detect resource bottlenecks given certain service level constrains represents an interesting case of study. In this project, a comparative study has been carried out for different distributed services taking into account the tools that are available for load generation, benchmarking and sensing of key performance indicators. Based on that, the big data processing framework Hadoop Mapreduce, has been deployed as a virtualized service on top of a distributed environment. Experiments for different cluster setups using different benchmarks have been conducted on this testbed in order to collect traces for both resource usage statistics at the infrastructure level and performance metrics at the platform level. Different machine learning approaches have been applied on the collected traces, thus generating prediction and classification models whose performance is then evaluated and compared. The highly accurate results, namely a Normalized Mean Absolute Error below 10.3% for the regressor and an accuracy score above 99.9% for the classifier, show the feasibility of the prediction models generated for service performance prediction and resource bottleneck detection that could be further used to trigger auto-scaling processes on cloud environments under dynamic loads in order to fulfill service level requirements.

APA, Harvard, Vancouver, ISO, and other styles

45

Koseler, Kaan Tamer. "Realization of Model-Driven Engineering for Big Data: A Baseball Analytics Use Case." Miami University / OhioLINK, 2018. http://rave.ohiolink.edu/etdc/view?acc_num=miami1524832924255132.

Full text

APA, Harvard, Vancouver, ISO, and other styles

46

Saenyi, Betty. "Opportunities and challenges of Big Data Analytics in healthcare : An exploratory study on the adoption of big data analytics in the Management of Sickle Cell Anaemia." Thesis, Internationella Handelshögskolan, Högskolan i Jönköping, IHH, Informatik, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:hj:diva-42864.

Full text

Abstract:

Background: With increasing technological advancements, healthcare providers are adopting electronic health records (EHRs) and new health information technology systems. Consequently, data from these systems is accumulating at a faster rate creating a need for more robust ways of capturing, storing and processing the data. Big data analytics is used in extracting insight form such large amounts of medical data and is increasingly becoming a valuable practice for healthcare organisations. Could these strategies be applied in disease management? Especially in rare conditions like Sickle Cell Disease (SCD)? The study answers the following research questions;1. What Data Management practices are used in Sickle Cell Anaemia management?2. What areas in the management of sickle cell anaemia could benefit from use of big data Analytics?3. What are the challenges of applying big data analytics in the management of sickle cell anaemia?Purpose: The purpose of this research was to serve as pre-study in establishing the opportunities and challenges of applying big data analytics in the management of SCDMethod: The study adopted both deductive and inductive approaches. Data was collected through interviews based on a framework which was modified specifically for this study. It was then inductively analysed to answer the research questions.Conclusion: Although there is a lot of potential for big data analytics in SCD in areas like population health management, evidence-based medicine and personalised care, its adoption is not a surety. This is because of lack of interoperability between the existing systems and strenuous legal compliant processes in data acquisition.

APA, Harvard, Vancouver, ISO, and other styles

47

Taratoris, Evangelos. "A single-pass grid-based algorithm for clustering big data on spatial databases." Thesis, Massachusetts Institute of Technology, 2017. http://hdl.handle.net/1721.1/113168.

Full text

Abstract:

Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2017
Cataloged from PDF version of thesis.
Includes bibliographical references (pages 79-80).
The problem of clustering multi-dimensional data has been well researched in the scientific community. It is a problem with wide scope and applications. With the rapid growth of very large databases, traditional clustering algorithms become inefficient due to insufficient memory capacity. Grid-based algorithms try to solve this problem by dividing the space into cells and then performing clustering on the cells. However these algorithms also become inefficient when even the grid becomes too large to be saved in memory. This thesis presents a new algorithm, SingleClus, that is performing clustering on a 2-dimensional dataset with a single pass of the dataset. Moreover, it optimizes the amount of disk I/0 operations while making modest use of main memory. Therefore it is theoretically optimal in terms of performance. It modifies and improves on the Hoshen-Kopelman clustering algorithm while dealing with the algorithm's fundamental challenges when operating in a Big Data setting.
by Evangelos Taratoris.
M. Eng.
M.Eng. Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science

APA, Harvard, Vancouver, ISO, and other styles

48

Zhang, Liangwei. "Big Data Analytics for Fault Detection and its Application in Maintenance." Doctoral thesis, Luleå tekniska universitet, Drift, underhåll och akustik, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:ltu:diva-60423.

Full text

Abstract:

Big Data analytics has attracted intense interest recently for its attempt to extract information, knowledge and wisdom from Big Data. In industry, with the development of sensor technology and Information & Communication Technologies (ICT), reams of high-dimensional, streaming, and nonlinear data are being collected and curated to support decision-making. The detection of faults in these data is an important application in eMaintenance solutions, as it can facilitate maintenance decision-making. Early discovery of system faults may ensure the reliability and safety of industrial systems and reduce the risk of unplanned breakdowns. Complexities in the data, including high dimensionality, fast-flowing data streams, and high nonlinearity, impose stringent challenges on fault detection applications. From the data modelling perspective, high dimensionality may cause the notorious “curse of dimensionality” and lead to deterioration in the accuracy of fault detection algorithms. Fast-flowing data streams require algorithms to give real-time or near real-time responses upon the arrival of new samples. High nonlinearity requires fault detection approaches to have sufficiently expressive power and to avoid overfitting or underfitting problems. Most existing fault detection approaches work in relatively low-dimensional spaces. Theoretical studies on high-dimensional fault detection mainly focus on detecting anomalies on subspace projections. However, these models are either arbitrary in selecting subspaces or computationally intensive. To meet the requirements of fast-flowing data streams, several strategies have been proposed to adapt existing models to an online mode to make them applicable in stream data mining. But few studies have simultaneously tackled the challenges associated with high dimensionality and data streams. Existing nonlinear fault detection approaches cannot provide satisfactory performance in terms of smoothness, effectiveness, robustness and interpretability. New approaches are needed to address this issue. This research develops an Angle-based Subspace Anomaly Detection (ABSAD) approach to fault detection in high-dimensional data. The efficacy of the approach is demonstrated in analytical studies and numerical illustrations. Based on the sliding window strategy, the approach is extended to an online mode to detect faults in high-dimensional data streams. Experiments on synthetic datasets show the online extension can adapt to the time-varying behaviour of the monitored system and, hence, is applicable to dynamic fault detection. To deal with highly nonlinear data, the research proposes an Adaptive Kernel Density-based (Adaptive-KD) anomaly detection approach. Numerical illustrations show the approach’s superiority in terms of smoothness, effectiveness and robustness.

APA, Harvard, Vancouver, ISO, and other styles

49

Newth, Oliver Edward. "Predicting extreme events : the role of big data in quantifying risk in structural development." Thesis, Massachusetts Institute of Technology, 2014. http://hdl.handle.net/1721.1/90028.

Full text

Abstract:

Thesis: M. Eng., Massachusetts Institute of Technology, Department of Civil and Environmental Engineering, 2014.
Cataloged from PDF version of thesis.
Includes bibliographical references (pages 71-73).
Engineers are well-placed when calculating the required resistance for natural and non-natural hazards. However, there are two main problems with the current approach. First, while hazards are one of the primary causes of catastrophic damage and the design against risk contributes vastly to the cost in design and construction, it is only considered late in the development process. Second, current design approaches tend to provide guidelines that do not explain the rationale behind the presented values, leaving the engineer without any true understanding of the actual risk of a hazard occurring. Data is a key aspect in accurate prediction, though its sources are often sparsely distributed and engineers rarely have the background in statistics to process this into meaningful and useful results. This thesis explores the existing approaches to designing against hazards, focussing on natural hazards such as earthquakes, and the type of existing geographic information systems (GIS) that exist to assist in this process. A conceptual design for a hazard-related GIS is then proposed, looking at the key requirements for a system that could communicate key hazard-related data and how it could be designed and implemented. Sources for hazard-related data are then discussed. Finally, models and methodologies for interpreting hazard-related data are examined, with a schematic for how a hazard focussed system could be structured. These look at how risk can be predicted in a transparent way which ensures that the user of such a system is able to understand the hazard-related risks for a given location.
by Oliver Edward Newth.
M. Eng.

APA, Harvard, Vancouver, ISO, and other styles

50

Guzun, Gheorghi. "Distributed indexing and scalable query processing for interactive big data explorations." Diss., University of Iowa, 2016. https://ir.uiowa.edu/etd/2087.

Full text

Abstract:

The past few years have brought a major surge in the volumes of collected data. More and more enterprises and research institutions find tremendous value in data analysis and exploration. Big Data analytics is used for improving customer experience, perform complex weather data integration and model prediction, as well as personalized medicine and many other services. Advances in technology, along with high interest in big data, can only increase the demand on data collection and mining in the years to come. As a result, and in order to keep up with the data volumes, data processing has become increasingly distributed. However, most of the distributed processing for large data is done by batch processing and interactive exploration is hardly an option. To efficiently support queries over large amounts of data, appropriate indexing mechanisms must be in place. This dissertation proposes an indexing and query processing framework that can run on top of a distributed computing engine, to support fast, interactive data explorations in data warehouses. Our data processing layer is built around bit-vector based indices. This type of indexing features fast bit-wise operations and scales up well for high dimensional data. Additionally, compression can be applied to reduce the index size, and thus utilize less memory and network communication. Our work can be divided into two areas: index compression and query processing. Two compression schemes are proposed for sparse and dense bit-vectors. The design of these encoding methods is hardware-driven, and the query processing is optimized for the available computing hardware. Query algorithms are proposed for selection, aggregation, and other specialized queries. The query processing is supported on single machines, as well as computer clusters.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Genomics Big Data Engineering'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles