Dissertations / Theses: 'Data engineering and data science'

1

Kanter, Max (James Max). "The data science machine : emulating human intelligence in data science endeavors." Thesis, Massachusetts Institute of Technology, 2015. http://hdl.handle.net/1721.1/107031.

Full text

Abstract:

Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2015.
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Cataloged from student-submitted PDF version of thesis.
Includes bibliographical references (pages 87-88).
Data scientists are responsible for many tasks in the data analysis process including formulating the question, generating features, building a model, and disseminating the results. The Data Science Machine is a automated system that emulates a human data scientist's ability to generate predictive models from raw data. In this thesis, we propose the Deep Feature Synthesis algorithm for automatically generating features for relational datasets. We implement this algorithm and test it on 3 data science competitions that have participation from nearly 1000 data science enthusiasts. In 2 of the 3 competitions we beat a majority of competitors, and in the third, we achieve 94% of the best competitor's score. Finally, we take steps towards incorporating the Data Science Machine into the data science process by implementing and evaluating an interface for users to interact with the Data Science Machine.
by Max Kanter
M. Eng.

APA, Harvard, Vancouver, ISO, and other styles

2

Wason, Jasmin Lesley. "Automating data management in science and engineering." Thesis, University of Southampton, 2001. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.396143.

Full text

APA, Harvard, Vancouver, ISO, and other styles

3

Smith, Micah J. (Micah Jacob). "Scaling collaborative open data science." Thesis, Massachusetts Institute of Technology, 2018. http://hdl.handle.net/1721.1/117819.

Full text

Abstract:

Thesis: S.M. in Computer Science, Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2018.
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Cataloged from student-submitted PDF version of thesis.
Includes bibliographical references (pages 103-107).
Large-scale, collaborative, open data science projects have the potential to address important societal problems using the tools of predictive machine learning. However, no suitable framework exists to develop such projects collaboratively and openly, at scale. In this thesis, I discuss the deficiencies of current approaches and then develop new approaches for this problem through systems, algorithms, and interfaces. A central theme is the restructuring of data science projects into scalable, fundamental units of contribution. I focus on feature engineering, structuring contributions as the creation of independent units of feature function source code. This then facilitates the integration of many submissions by diverse collaborators into a single, unified, machine learning model, where contributions can be rigorously validated and verified to ensure reproducibility and trustworthiness. I validate this concept by designing and implementing a cloud-based collaborative feature engineering platform, Feature- Hub, as well as an associated discussion platform for real-time collaboration. The platform is validated through an extensive user study and modeling performance is benchmarked against data science competition results. In the process, I also collect and analyze a novel data set on the feature engineering source code submitted by crowd data scientist workers of varying backgrounds around the world. Within this context, I discuss paths forward for collaborative data science.
by Micah J. Smith.
S.M. in Computer Science

APA, Harvard, Vancouver, ISO, and other styles

4

Yang, Ying. "Interactive Data Management and Data Analysis." Thesis, State University of New York at Buffalo, 2017. http://pqdtopen.proquest.com/#viewpdf?dispub=10288109.

Full text

Abstract:

Everyone today has a big data problem. Data is everywhere and in different formats, they can be referred to as data lakes, data streams, or data swamps. To extract knowledge or insights from the data or to support decision-making, we need to go through a process of collecting, cleaning, managing and analyzing the data. In this process, data cleaning and data analysis are two of the most important and time-consuming components.

One common challenge in these two components is a lack of interaction. The data cleaning and data analysis are typically done as a batch process, operating on the whole dataset without any feedback. This leads to long, frustrating delays during which users have no idea if the process is effective. Lacking interaction, human expert effort is needed to make decisions on which algorithms or parameters to use in the systems for these two components.

We should teach computers to talk to humans, not the other way around. This dissertation focuses on building systems --- Mimir and CIA --- that help user conduct data cleaning and analysis through interaction. Mimir is a system that allows users to clean big data in a cost- and time-efficient way through interaction, a process I call on-demand ETL. Convergent inference algorithms (CIA) are a family of inference algorithms in probabilistic graphical models (PGM) that enjoys the benefit of both exact and approximate inference algorithms through interaction.

Mimir provides a general language for user to express different data cleaning needs. It acts as a shim layer that wraps around the database making it possible for the bulk of the ETL process to remain within a classical deterministic system. Mimir also helps users to measure the quality of an analysis result and provides rankings for cleaning tasks to improve the result quality in a cost efficient manner. CIA focuses on providing user interaction through the process of inference in PGMs. The goal of CIA is to free users from the upfront commitment to either approximate or exact inference, and provide user more control over time/accuracy trade-offs to direct decision-making and computation instance allocations. This dissertation describes the Mimir and CIA frameworks to demonstrate that it is feasible to build efficient interactive data management and data analysis systems.

APA, Harvard, Vancouver, ISO, and other styles

5

Gertner, Yael. "Private data base access schemes avoiding data distribution." Thesis, Massachusetts Institute of Technology, 1997. http://hdl.handle.net/1721.1/42730.

Full text

APA, Harvard, Vancouver, ISO, and other styles

6

Li, Richard D. (Richard Ding) 1978. "Web clickstream data analysis using a dimensional data warehouse." Thesis, Massachusetts Institute of Technology, 2000. http://hdl.handle.net/1721.1/86671.

Full text

Abstract:

Thesis (M.Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, February 2001.
Includes bibliographical references (leaves 83-84).
by Richard D. Li.
M.Eng.

APA, Harvard, Vancouver, ISO, and other styles

7

Ramanayaka, Mudiyanselage Asanga. "Data Engineering and Failure Prediction for Hard Drive S.M.A.R.T. Data." Bowling Green State University / OhioLINK, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=bgsu1594957948648404.

Full text

APA, Harvard, Vancouver, ISO, and other styles

8

Derksen, Timothy J. (Timothy John). "Processing of outliers and missing data in multivariate manufacturing data." Thesis, Massachusetts Institute of Technology, 1996. http://hdl.handle.net/1721.1/38800.

Full text

Abstract:

Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1996.
Includes bibliographical references (leaf 64).
by Timothy J. Derksen.
M.Eng.

APA, Harvard, Vancouver, ISO, and other styles

9

Wang, Yi. "Data Management and Data Processing Support on Array-Based Scientific Data." The Ohio State University, 2015. http://rave.ohiolink.edu/etdc/view?acc_num=osu1436157356.

Full text

APA, Harvard, Vancouver, ISO, and other styles

10

Chiesa, Alessandro. "Proof-carrying data." Thesis, Massachusetts Institute of Technology, 2010. http://hdl.handle.net/1721.1/61151.

Full text

Abstract:

Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2010.
Page 96 blank. Cataloged from PDF version of thesis.
Includes bibliographical references (p. 87-95).
The security of systems can often be expressed as ensuring that some property is maintained at every step of a distributed computation conducted by untrusted parties. Special cases include integrity of programs running on untrusted platforms, various forms of confidentiality and side-channel resilience, and domain-specific invariants. We propose a new approach, proof-carrying data (PCD), which sidesteps the threat of faults and leakage by reasoning about properties of a computation's output data, regardless of the process that produced it. In PCD, the system designer prescribes the desired properties of a computation's outputs. Corresponding proofs are attached to every message flowing through the system, and are mutually verified by the system's components. Each such proof attests that the message's data and all of its history comply with the prescribed properties. We construct a general protocol compiler that generates, propagates, and verifies such proofs of compliance, while preserving the dynamics and efficiency of the original computation. Our main technical tool is the cryptographic construction of short non-interactive arguments (computationally-sound proofs) for statements whose truth depends on "hearsay evidence": previous arguments about other statements. To this end, we attain a particularly strong proof-of-knowledge property. We realize the above, under standard cryptographic assumptions, in a model where the prover has blackbox access to some simple functionality - essentially, a signature card.
by Alessandro Chiesa.
M.Eng.

APA, Harvard, Vancouver, ISO, and other styles

11

Emerson, Leslie Christopher. "A study of indexing structures for data in science and engineering." Thesis, Queen's University Belfast, 1990. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.335498.

Full text

APA, Harvard, Vancouver, ISO, and other styles

12

Tsai, Po-An. "Reducing data movement in multicore chips with computation and data co-scheduling." Thesis, Massachusetts Institute of Technology, 2015. http://hdl.handle.net/1721.1/99839.

Full text

Abstract:

Thesis: S.M., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2015.
Cataloged from PDF version of thesis.
Includes bibliographical references (pages 59-63).
Energy efficiency is the main limitation to the performance of parallel systems. Current architectures often focus on making cores more efficient. However, data movement is much more costly than basic compute operations. For example, at 28 nm, a main memory access is 100x slower and consumes 1000x the energy of a floatingpoint operation, and moving 64 bytes across a 16-core processor is 50 x slower and consumes 20 x the energy. Without a drastic reduction in data movement, memory accesses and communication costs will limit the scalability of future computing systems. Conventional hardware-only and software-only techniques miss many opportunities to reduce data movement. This thesis presents computation and data co-scheduling (CDCS), a technique that jointly performs computation and data placement to reduce both on-chip and off-chip data movement. CDCS integrates hardware and software techniques: Hardware lets software control data mapping to physically distributed caches, and software uses this support to periodically reconfigure the chip, minimizing data movement. On a simulated 64-core system, CDCS outperforms a standard last-level cache by 46% on average (up to 76%) in weighted speedup, reduces both on-chip network traffic (by 11 x) and off-chip traffic (by 23%), and saves 36% of system energy.
by Po-An Tsai.
S.M.

APA, Harvard, Vancouver, ISO, and other styles

13

Eastep, Jonathan M. (Jonathan Michael). "Smart data structures : an online machine learning approach to multicore data structures." Thesis, Massachusetts Institute of Technology, 2011. http://hdl.handle.net/1721.1/65967.

Full text

Abstract:

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2011.
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Cataloged from student submitted PDF version of thesis.
Includes bibliographical references (p. 175-180).
As multicores become prevalent, the complexity of programming is skyrocketing. One major difficulty is eciently orchestrating collaboration among threads through shared data structures. Unfortunately, choosing and hand-tuning data structure algorithms to get good performance across a variety of machines and inputs is a herculean task to add to the fundamental difficulty of getting a parallel program correct. To help mitigate these complexities, this work develops a new class of parallel data structures called Smart Data Structures that leverage online machine learning to adapt themselves automatically. We prototype and evaluate an open source library of Smart Data Structures for common parallel programming needs and demonstrate signicant improvements over the best existing algorithms under a variety of conditions. Our results indicate that learning is a promising technique for balancing and adapting to complex, time-varying tradeoffs and achieving the best performance available.
by Jonathan M. Eastep.
Ph.D.

APA, Harvard, Vancouver, ISO, and other styles

14

Jones, Dalton James. "The value of data." Thesis, Massachusetts Institute of Technology, 2018. http://hdl.handle.net/1721.1/115746.

Full text

Abstract:

Thesis: S.M., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2018.
Cataloged from PDF version of thesis.
Includes bibliographical references (pages 71-73).
Data and information are integral to the modern economic system. Advances in technology have allowed companies to both collect and utilize vast amounts of data. At times this data can be very private and collected surreptitiously. Smartphones and other devices that keep us in constant contact with the internet provide companies like Google and Facebook with a wealth of information to sell. Despite all this, there currently does not exist a systematic way to value data. In the absence of such valuations, gross economic inefficiencies are inevitable. In this thesis, we seek to model ways in which data can be bought, sold, and used fairly in an economic environment. We also develop a theory to value data in different settings. Our models and results are applied to a variety of different domains to demonstrate their efficacy. Results from game theory and mathematical programming allow us to provide fair and efficient allocations of data. This research shows that there exists an efficient and fair method with which to determine the value of information and data and to trade it fairly.
by Dalton James Jones.
S.M.

APA, Harvard, Vancouver, ISO, and other styles

15

Wang, Grant J. (Grant Jenhorn) 1979. "Algorithms for data mining." Thesis, Massachusetts Institute of Technology, 2006. http://hdl.handle.net/1721.1/38315.

Full text

Abstract:

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2006.
Includes bibliographical references (p. 81-89).
Data of massive size are now available in a wide variety of fields and come with great promise. In theory, these massive data sets allow data mining and exploration on a scale previously unimaginable. However, in practice, it can be difficult to apply classic data mining techniques to such massive data sets due to their sheer size. In this thesis, we study three algorithmic problems in data mining with consideration to the analysis of massive data sets. Our work is both theoretical and experimental - we design algorithms and prove guarantees for their performance and also give experimental results on real data sets. The three problems we study are: 1) finding a matrix of low rank that approximates a given matrix, 2) clustering high-dimensional points into subsets whose points lie in the same subspace, and 3) clustering objects by pairwise similarities/distances.
by Grant J. Wang.
Ph.D.

APA, Harvard, Vancouver, ISO, and other styles

16

Du, George J. "Interpreting and optimizing data." Thesis, Massachusetts Institute of Technology, 2017. http://hdl.handle.net/1721.1/112840.

Full text

Abstract:

Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2017.
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Cataloged from student-submitted PDF version of thesis.
Includes bibliographical references (pages 28-29).
The goal of this research was twofold. The first goal was to use observational data to propose interventions under various constraints, without explicitly inferring a causal graph. These interventions may be optimized for a single individual within a population, or for an entire population. Under certain assumptions, we found that is possible to provide theoretical guarantees for the intervention results when we model the data with a Gaussian process. The second goal was to map various data, including sentences and medical images, to a simple, understandable latent space, in which an intervention optimization routine may be used to nd beneficial interventions. To this end, variational autoencoders were used. We found that while the Gaussian process technique was able to successfully identify interventions in both simulations and practical applications, the variational autoencoder approach did not retain enough information about the input to be competitive with current approaches for classication, such as deep CNNs.
by George J. Du.
M. Eng.

APA, Harvard, Vancouver, ISO, and other styles

17

Richard, Christopher Aaron. "Data-driven logistic planning." Thesis, Massachusetts Institute of Technology, 1996. http://hdl.handle.net/1721.1/38832.

Full text

Abstract:

Thesis (M.S.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1996.
Includes bibliographical references (p. 109-111).
by Christopher Aaron Richard.
M.S.

APA, Harvard, Vancouver, ISO, and other styles

18

Wong, Brian T. (Brian Tak-Ho) 1978. "The Abstract Data Interface." Thesis, Massachusetts Institute of Technology, 2001. http://hdl.handle.net/1721.1/86744.

Full text

Abstract:

Thesis (M.Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2001.
Includes bibliographical references (leaves 107-113).
by Brian T. Wong.
M.Eng.

APA, Harvard, Vancouver, ISO, and other styles

19

Lasko, Thomas A. (Thomas Anton) 1965. "Spectral anonymization of data." Thesis, Massachusetts Institute of Technology, 2007. http://hdl.handle.net/1721.1/42055.

Full text

Abstract:

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2007.
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Includes bibliographical references (p. 87-96).
Data anonymization is the process of conditioning a dataset such that no sensitive information can be learned about any specific individual, but valid scientific analysis can nevertheless be performed on it. It is not sufficient to simply remove identifying information because the remaining data may be enough to infer the individual source of the record (a reidentification disclosure) or to otherwise learn sensitive information about a person (a predictive disclosure). The only known way to prevent these disclosures is to remove additional information from the dataset. Dozens of anonymization methods have been proposed over the past few decades; most work by perturbing or suppressing variable values. None have been successful at simultaneously providing perfect privacy protection and allowing perfectly accurate scientific analysis. This dissertation makes the new observation that the anonymizing operations do not need to be made in the original basis of the dataset. Operating in a different, judiciously chosen basis can improve privacy protection, analytic utility, and computational efficiency. I use the term 'spectral anonymization' to refer to anonymizing in a spectral basis, such as the basis provided by the data's eigenvectors. Additionally, I propose new measures of reidentification and prediction risk that are more generally applicable and more informative than existing measures. I also propose a measure of analytic utility that assesses the preservation of the multivariate probability distribution. Finally, I propose the demanding reference standard of nonparticipation in the study to define adequate privacy protection. I give three examples of spectral anonymization in practice. The first example improves basic cell swapping from a weak algorithm to one competitive with state of-the-art methods merely by a change of basis.
(cont) The second example demonstrates avoiding the curse of dimensionality in microaggregation. The third describes a powerful algorithm that reduces computational disclosure risk to the same level as that of nonparticipants and preserves at least 4th order interactions in the multivariate distribution. No previously reported algorithm has achieved this combination of results.
by Thomas Anton Lasko.
Ph.D.

APA, Harvard, Vancouver, ISO, and other styles

20

Kuncak, Viktor (Viktor Jaroslav) 1977. "Modular data structure verification." Thesis, Massachusetts Institute of Technology, 2007. http://hdl.handle.net/1721.1/38533.

Full text

Abstract:

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2007.
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Includes bibliographical references (p. 149-166).
This dissertation describes an approach for automatically verifying data structures, focusing on techniques for automatically proving formulas that arise in such verification. I have implemented this approach with my colleagues in a verification system called Jahob. Jahob verifies properties of Java programs with dynamically allocated data structures. Developers write Jahob specifications in classical higher-order logic (HOL); Jahob reduces the verification problem to deciding the validity of HOL formulas. I present a new method for proving HOL formulas by combining automated reasoning techniques. My method consists of 1) splitting formulas into individual HOL conjuncts, 2) soundly approximating each HOL conjunct with a formula in a more tractable fragment and 3) proving the resulting approximation using a decision procedure or a theorem prover. I present three concrete logics; for each logic I show how to use it to approximate HOL formulas, and how to decide the validity of formulas in this logic. First, I present an approximation of HOL based on a translation to first-order logic, which enables the use of existing resolution-based theorem provers. Second, I present an approximation of HOL based on field constraint analysis, a new technique that enables decision procedures for special classes of graphs (such as monadic second-order logic over trees) to be applied to arbitrary graphs.
(cont.) Third, I present an approximation using Boolean Algebra with Presburger Arithmetic (BAPA), a logic that combines reasoning about sets of elements with reasoning about cardinalities of sets. BAPA can express relationships between sizes of data structures and invariants that correlate data structure size with integer variables. I present the first implementation of a BAPA decision procedure, and establish the exact complexity bounds for BAPA and quantifier-free BAPA. Together, these techniques enabled Jahob to modularly and automatically verify data structure implementations based on singly and doubly-linked lists, trees with parent pointers, priority queues, and hash tables. In particular, Jahob was able to prove that data structure implementations satisfy their specifications, maintain key data structure invariants expressed in a rich logical notation, and never produce run-time errors such as null dereferences or out of bounds accesses.
by Viktor Kuncak.
Ph.D.

APA, Harvard, Vancouver, ISO, and other styles

21

Breest, Martin, Paul Bouché, Martin Grund, Sören Haubrock, Stefan Hüttenrauch, Uwe Kylau, Anna Ploskonos, Tobias Queck, and Torben Schreiter. "Fundamentals of Service-Oriented Engineering." Universität Potsdam, 2006. http://opus.kobv.de/ubp/volltexte/2009/3380/.

Full text

Abstract:

Since 2002, keywords like service-oriented engineering, service-oriented computing, and service-oriented architecture have been widely used in research, education, and enterprises. These and related terms are often misunderstood or used incorrectly. To correct these misunderstandings, a deeper knowledge of the concepts, the historical backgrounds, and an overview of service-oriented architectures is demanded and given in this paper.

APA, Harvard, Vancouver, ISO, and other styles

22

Weston, Bron O. Duren Russell Walker Thompson Michael Wayne. "Data compression application to the MIL-STD 1553 avionics data bus." Waco, Tex. : Baylor University, 2005. http://hdl.handle.net/2104/2882.

Full text

APA, Harvard, Vancouver, ISO, and other styles

23

Bao, Shunxing. "Algorithmic Enhancements to Data Colocation Grid Frameworks for Big Data Medical Image Processing." Thesis, Vanderbilt University, 2019. http://pqdtopen.proquest.com/#viewpdf?dispub=13877282.

Full text

Abstract:

Large-scale medical imaging studies to date have predominantly leveraged in-house, laboratory-based or traditional grid computing resources for their computing needs, where the applications often use hierarchical data structures (e.g., Network file system file stores) or databases (e.g., COINS, XNAT) for storage and retrieval. The resulting performance for laboratory-based approaches reveal that performance is impeded by standard network switches since typical processing can saturate network bandwidth during transfer from storage to processing nodes for even moderate-sized studies. On the other hand, the grid may be costly to use due to the dedicated resources used to execute the tasks and lack of elasticity. With increasing availability of cloud-based big data frameworks, such as Apache Hadoop, cloud-based services for executing medical imaging studies have shown promise.

Despite this promise, our studies have revealed that existing big data frameworks illustrate different performance limitations for medical imaging applications, which calls for new algorithms that optimize their performance and suitability for medical imaging. For instance, Apache HBases data distribution strategy of region split and merge is detrimental to the hierarchical organization of imaging data (e.g., project, subject, session, scan, slice). Big data medical image processing applications involving multi-stage analysis often exhibit significant variability in processing times ranging from a few seconds to several days. Due to the sequential nature of executing the analysis stages by traditional software technologies and platforms, any errors in the pipeline are only detected at the later stages despite the sources of errors predominantly being the highly compute-intensive first stage. This wastes precious computing resources and incurs prohibitively higher costs for re-executing the application. To address these challenges, this research propose a framework - Hadoop & HBase for Medical Image Processing (HadoopBase-MIP) - which develops a range of performance optimization algorithms and employs a number of system behaviors modeling for data storage, data access and data processing. We also introduce how to build up prototypes to help empirical system behaviors verification. Furthermore, we introduce a discovery with the development of HadoopBase-MIP about a new type of contrast for medical imaging deep brain structure enhancement. And finally we show how to move forward the Hadoop based framework design into a commercialized big data / High performance computing cluster with cheap, scalable and geographically distributed file system.

APA, Harvard, Vancouver, ISO, and other styles

24

Ma, Yao 1975. "Data warehousing, OLAP, and data mining : an integrated strategy for use at FAA." Thesis, Massachusetts Institute of Technology, 1998. http://hdl.handle.net/1721.1/47590.

Full text

Abstract:

Thesis (S.B. and M.Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1998.
Includes bibliographical references (leaves 76-77).
by Yao Ma.
S.B.and M.Eng.

APA, Harvard, Vancouver, ISO, and other styles

25

Kapoor, Rakesh R. (Rakesh Rameshchandra). "Thermodynamic data from diffusion couples." Thesis, Massachusetts Institute of Technology, 1989. http://hdl.handle.net/1721.1/14217.

Full text

APA, Harvard, Vancouver, ISO, and other styles

26

Shang, Zeyuan. "Democratizing data science through interactive curation of ML pipelines." Thesis, Massachusetts Institute of Technology, 2020. https://hdl.handle.net/1721.1/128304.

Full text

Abstract:

This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Thesis: S.M., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, February, 2020
Cataloged from student-submitted PDF version of thesis.
Includes bibliographical references (pages 63-66).
Statistical knowledge and domain expertise are key to extract actionable insights out of data, yet such skills rarely coexist together. In Machine Learning, high-quality results are only attainable via mindful data preprocessing, hyperparameter tuning and model selection. Domain experts are often overwhelmed by such complexity, de-facto inhibiting a wider adoption of ML techniques in other fields. Existing libraries that claim to solve this problem, still require well-trained practitioners. Those frameworks involve heavy data preparation steps and are often too slow for interactive feedback from the user, severely limiting the scope of such systems. In this work we present Alpine Meadow, a first Interactive Automated Machine Learning tool. What makes the system unique is not only the focus on interactivity, but also the combined systemic and algorithmic design approach; on one hand we leverage ideas from query optimization, on the other we devise novel selection and pruning strategies combining cost-based Multi-Armed Bandits and Bayesian Optimization. We evaluate the system on over 300 datasets and compare against other AutoML tools, including the current NIPS winner, as well as expert solutions. Not only is Alpine Meadow able to significantly outperform the other AutoML systems while -- in contrast to the other systems -- providing interactive latencies, but also outperforms in 80% of the cases expert solutions over data sets we have never seen before.
by Zeyuan Shang.
S.M.
S.M. Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science

APA, Harvard, Vancouver, ISO, and other styles

27

Karamalis, Constantinos. "Data perturbation analyses for linear programming." Thesis, University of Ottawa (Canada), 1994. http://hdl.handle.net/10393/6709.

Full text

Abstract:

This thesis focuses on several aspects of data perturbation for Linear Programming. Classical questions of degeneracy and post-optimal analysis are given a unified presentation, in a view of new interior point methods of linear programming. The performance of these methods is compared to the simplex algorithm; interior point methods are shown to alleviate some difficulties of representation and solution of linear programs. An affine scaling algorithm is implemented in conjunction with a simple rounding heuristic to asses the benefit of interior point trajectories to provide approximate solutions of linear integer programming.

APA, Harvard, Vancouver, ISO, and other styles

28

Jahandideh, Mohammad Taghi. "Option pricing for infinite variance data." Thesis, University of Ottawa (Canada), 2004. http://hdl.handle.net/10393/26665.

Full text

Abstract:

Infinite variance distributions are among the competing models used to explain the non-normality of stock price changes (Mandelbrot, 1963; Fama, 1965; Mandelbrot and Taylor, 1967; Rachev and Samorodnitsky, 1993). We investigate the asymptotic option price formula in infinite variance setting for both independent and correlated data using point processes. As we shall see the application of point process models can also lead us to investigate a more general option price formula. We also apply a recursion technique to quantify various characteristics of the resulting formulas. It shows that such formulas, and even their approximations, may be difficult to apply in practice. A nonparametric bootstrap method is proposed as one alternative approach and its asymptotic consistency is established under a resampling scheme of m = o(n). Some empirical evidence is provided showing the method works in principle, although large sample sizes appear to be needed for accuracy. This method is also illustrated using publicly available financial data.

APA, Harvard, Vancouver, ISO, and other styles

29

Cao, Ronald. "Mulitvariate analysis of manufacturing data." Thesis, Massachusetts Institute of Technology, 1997. http://hdl.handle.net/1721.1/42746.

Full text

APA, Harvard, Vancouver, ISO, and other styles

30

Matusik, Wojciech 1973. "A data-driven reflectance model." Thesis, Massachusetts Institute of Technology, 2003. http://hdl.handle.net/1721.1/87454.

Full text

Abstract:

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2003.
"September 2003."
Includes bibliographical references (leaves 112-115).
I present a data-driven model for isotropic bidirectional reflectance distribution functions (BRDFs) based on acquired reflectance data. Instead of using analytic reflectance models, each BRDF is represented as a dense set of measurements. This representation allows interpolation and extrapolation in the space of acquired BRDFs to create new BRDFs. Each acquired BRDF is treated as a single high-dimensional vector taken from the space of all possible BRDFs. Both linear (subspace) and non-linear (manifold) dimensionality reduction tools are applied in an effort to discover a lower-dimensional representation that characterizes the acquired BRDFs. To complete the model, users are provided with the means for defining perceptually meaningful parametrizations that allow them to navigate in the reduced-dimension BRDF space. On the low-dimensional manifold, movement along these directions produces novel, but valid, BRDFs. By analyzing a large collection of reflectance data, I also derive two novel reflectance sampling procedures that require fewer total measurements than standard uniform sampling approaches. Using densely sampled measurements the general surface reflectance function is analyzed to determine the local signal variation at each point in the function's domain. Wavelet analysis is used to derive a common basis for all of the acquired reflectance functions, as well as a non-uniform sampling pattern that corresponds to all non-zero wavelet coefficients. Second, I show that the reflectance of an arbitrary material can be represented as a linear combination of the surface reflectance functions. Furthermore, this analysis specifies a reduced set of sampling points that permits the robust estimation of the coefficients of this linear combination.
(cont.) These procedures dramatically shorten the acquisition time for isotropic reflectance measurements.
by Wojciech Matusik.
Ph.D.

APA, Harvard, Vancouver, ISO, and other styles

31

Zhang, Chiyuan Ph D. Massachusetts Institute of Technology. "Deep learning and structured data." Thesis, Massachusetts Institute of Technology, 2018. http://hdl.handle.net/1721.1/115643.

Full text

Abstract:

Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2018.
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Cataloged from student-submitted PDF version of thesis.
Includes bibliographical references (pages 135-150).
In the recent years deep learning has witnessed successful applications in many different domains such as visual object recognition, detection and segmentation, automatic speech recognition, natural language processing, and reinforcement learning. In this thesis, we will investigate deep learning from a spectrum of different perspectives. First of all, we will study the question of generalization, which is one of the most fundamental notion in machine learning theory. We will show how, in the regime of deep learning, the characterization of generalization becomes different from the conventional way, and propose alternative ways to approach it. Moving from theory to more practical perspectives, we will show two different applications of deep learning. One is originated from a real world problem of automatic geophysical feature detection from seismic recordings to help oil & gas exploration; the other is motivated from a computational neuroscientific modeling and studying of human auditory system. More specifically, we will show how deep learning could be adapted to play nicely with the unique structures associated with the problems from different domains. Lastly, we move to the computer system design perspective, and present our efforts in building better deep learning systems to allow efficient and flexible computation in both academic and industrial worlds.
by Chiyuan Zhang.
Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

32

Price, Michael Ph D. (Michael R. ). Massachusetts Institute of Technology. "Asynchronous data-dependent jitter compensation." Thesis, Massachusetts Institute of Technology, 2009. http://hdl.handle.net/1721.1/52771.

Full text

Abstract:

Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2009.
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Includes bibliographical references (p. 95-96).
Data-dependent jitter (DDJ) caused by lossy channels is a limiting factor in the bit rates that can be achieved reliably over serial links. This thesis explains the causes of DDJ and existing equalization techniques, then develops an asynchronous (clock-agnostic) architecture for DDJ compensation. The compensation circuit alters the transition times of a digital signal to cancel the expected channel-induced delays. It is designed for a 0.35 [mu]m BiCMOS process with a 240 x 140 ¹m footprint and typically consumes 3.4 mA, a small fraction of the current used in a typical transmitter. Extensive simulations demonstrate that the circuit has the potential to reduce channel-induced DDJ by at least 50% at bit rates of 6.25 Gb/s and 10 Gb/s.
by Michael Price.
M.Eng.

APA, Harvard, Vancouver, ISO, and other styles

33

Barr, Kenneth C. (Kenneth Charles) 1978. "Energy aware lossless data compression." Thesis, Massachusetts Institute of Technology, 2002. http://hdl.handle.net/1721.1/87316.

Full text

APA, Harvard, Vancouver, ISO, and other styles

34

Szummer, Marcin Olof. "Learning from partially labeled data." Thesis, Massachusetts Institute of Technology, 2002. http://hdl.handle.net/1721.1/29273.

Full text

Abstract:

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2002.
Includes bibliographical references (p. 75-81).
Classification with partially labeled data involves learning from a few labeled examples as well as a large number of unlabeled examples, and represents a blend of supervised and unsupervised learning. Unlabeled examples provide information about the input domain distribution, but only the labeled examples indicate the actual classification task. The key question is how to improve classification accuracy by linking aspects of the input distribution P(x) to the conditional output distribution P(yx ) of the classifier. This thesis presents three approaches to the problem, starting with a kernel classifier that can be interpreted as a discriminative kernel density estimator and is trained via the EM algorithm or via margin-based criteria. Secondly, we employ a Markov random walk representation that exploits clusters and low-dimensional structure in the data in a robust and probabilistic manner. Thirdly, we introduce information regularization, a non-parametric technique based on minimizing information about labels over regions covering the domain. Information regularization provides a direct and principled way of linking P(x) to P(yx), and remains tractable for continuous P(x). The partially labeled problem arises in many applications where it is easy to collect unlabeled examples, but labor-intensive to classify the examples. The thesis demonstrates that the approaches require very few labeled examples for high classification accuracy on text and image-classification tasks.
by Marcin Olof Szummer.
Ph.D.

APA, Harvard, Vancouver, ISO, and other styles

35

Billing, Jeffrey J. (Jeffrey Joel) 1979. "Learning classifiers from medical data." Thesis, Massachusetts Institute of Technology, 2002. http://hdl.handle.net/1721.1/8068.

Full text

Abstract:

Thesis (M.Eng. and S.B.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2002.
Includes bibliographical references (leaf 32).
The goal of this thesis was to use machine-learning techniques to discover classifiers from a database of medical data. Through the use of two software programs, C5.0 and SVMLight, we analyzed a database of 150 patients who had been operated on by Dr. David Rattner of the Massachusetts General Hospital. C5.0 is an algorithm that learns decision trees from data while SVMLight learns support vector machines from the data. With both techniques we performed cross-validation analysis and both failed to produce acceptable error rates. The end result of the research was that no classifiers could be found which performed well upon cross-validation analysis. Nonetheless, this paper provides a thorough examination of the different issues that arise during the analysis of medical data as well as describes the different techniques that were used as well as the different issues with the data that affected the performance of these techniques.
by Jeffrey J. Billing.
M.Eng.and S.B.

APA, Harvard, Vancouver, ISO, and other styles

36

Benson, Edward 1983. "A data aware web architecture." Thesis, Massachusetts Institute of Technology, 2010. http://hdl.handle.net/1721.1/60156.

Full text

Abstract:

Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2010.
Includes bibliographical references (p. 86-89).
This thesis describes a client-server toolkit called Sync Kit that demonstrates how client-side database storage can improve the performance of data intensive websites. Sync Kit is designed to make use of the embedded relational database defined in the upcoming HTML5 standard to offload some data storage and processing from a web server onto the web browsers to which it serves content. Sync Kit provides various strategies for synchronizing relational database tables between the browser and the web server, along with a client-side template library so that portions web applications may be executed client-side. Unlike prior work in this area, Sync Kit persists both templates and data in the browser across web sessions, increasing the number of concurrent connections a server can handle by up to a factor of four versus that of a traditional server-only web stack and a factor of three versus a recent template caching approach.
by Edward Benson.
S.M.

APA, Harvard, Vancouver, ISO, and other styles

37

Dasari, Vivek. "Platform for spatial molecular data." Thesis, Massachusetts Institute of Technology, 2015. http://hdl.handle.net/1721.1/107103.

Full text

Abstract:

Thesis: M. Eng. in Computer Science and Molecular Biology, Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2015.
Cataloged from PDF version of thesis.
Includes bibliographical references (pages 49-50).
I designed and implemented a comprehensive platform for storing, analyzing, visualizing, and interacting with spatial molecular data. With the advent of high throughput in situ sequencing methods, such as fluorescent in situ sequencing (FISSEQ), the need for a platform to organize spatial molecular data has become paramount. The platform is divided into seven services: raw data handling, a spatial coordinate system, an analysis service, an image service, a molecular data service, a spatial data service and a visualization service. Together, these services compose a modular system for organizing the next generation of spatial molecular data.
by Vivek Dasari.
M. Eng. in Computer Science and Molecular Biology

APA, Harvard, Vancouver, ISO, and other styles

38

Turmukhametova, Aizana. "Diverse sampling of streaming data." Thesis, Massachusetts Institute of Technology, 2013. http://hdl.handle.net/1721.1/85230.

Full text

Abstract:

Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2013.
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Cataloged from student-submitted PDF version of thesis.
Includes bibliographical references (pages 49-51).
This thesis addresses the problem of diverse sampling as a dispersion problem and proposes solutions that are optimized for large streaming data. Finding the optimal solution to the dispersion problem is NP-hard. Therefore, existing and proposed solutions are approximation algorithms. This work evaluates the performance of dierent algorithms in practice and compares them to the theoretical guarantees.
by Aizana Turmukhametova.
M. Eng.

APA, Harvard, Vancouver, ISO, and other styles

39

Alsheikh, Sami Thabet. "Automated understanding of data visualizations." Thesis, Massachusetts Institute of Technology, 2017. http://hdl.handle.net/1721.1/112830.

Full text

Abstract:

Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2017.
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Cataloged from student-submitted PDF version of thesis.
Includes bibliographical references (pages 63-65).
When a person views a data visualization (graph, chart, infographic, etc.), they read the text and process the images to quickly understand the communicated message. This research works toward emulating this ability in computers. In pursuing this goal, we have explored three primary research objectives: 1) extracting and ranking the most relevant keywords in a data visualization 2) predicting a sensible topic and multiple subtopics for a data visualization, and 3) extracting relevant pictographs from a data visualization. For the first task, we create an automatic text extraction and ranking system which we evaluate on 202 MASSVIS data visualizations. For the last two objectives, we curate a more diverse and complex dataset, Visually. We devise a computational approach that automatically outputs textual and visual elements predicted representative of the data visualization content. Concretely, from the curated Visually dataset of 29K large infographic images sampled across 26 categories and 391 tags, we present an automated two step approach: first, we use extracted text to predict the text tags indicative of the infographic content, and second, we use these predicted text tags to localize the most diagnostic visual elements (what we have called "visual tags"). We report performances on a categorization and multi-label tag prediction problem and compare the results to human annotations. Our results show promise for automated human-like understanding of data visualizations.
by Sami Thabet Alsheikh.
M. Eng.

APA, Harvard, Vancouver, ISO, and other styles

40

Xiao, Katharine (Katharine J. ). "Towards automatically linking data elements." Thesis, Massachusetts Institute of Technology, 2017. http://hdl.handle.net/1721.1/113450.

Full text

Abstract:

Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2017.
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Cataloged from student-submitted PDF version of thesis.
Includes bibliographical references (pages 91-92).
When presented with a new dataset, human data scientists explore it in order to identify salient properties of the data elements, identify relationships between entities, and write processing software that makes use of those relationships accordingly. While there has been progress made on automatically processing the data to generate features or models, most automation systems rely on receiving a data model that has all the meta information about the data, including salient properties and relationships. In this thesis, we present a first version of our system, called ADEL-Automatic Data Elements Linking. Given a collection of files, this system generates a relational data schema and identifies other salient properties. It detects the type of each data field, which describes not only the programmatic data type but also the context in which the data originated, through a method called Type Detection. For each file, it identifies the field that uniquely describes each row in it, also known as a Primary Key. Then, it discovers relationships between different data entities with Relationship Discovery, and discovers any implicit constraints in the data through Hard Constraint Discovery. We posit two out of these four problems as learning problems. To evaluate our algorithms, we compare the results of each to a set of manual annotations. For Type Detection, we saw a max error of 7%, with an average error of 2.2% across all datasets. For Primary Key Detection, we classified all existing primary keys correctly, and had one false positive across five datasets. For Relationship Discovery, we saw an average error of 5.6%. (Our results are limited by the small number of manual annotations we currently possess.) We then feed the output of our system into existing semi-automated data science software systems - the Deep Feature Synthesis (DFS) algorithm, which generates features for predictive models, and the Synthetic Data Vault (SDV), which generates a hierarchical graphical model. When ADEL's data annotations are fed into DFS, it produces similar or higher predictive accuracy in 3/4 problems, and when they are provided to SDV, it is able to generate synthetic data with no constraint violations.
by Katharine Xiao.
M. Eng.

APA, Harvard, Vancouver, ISO, and other styles

41

Aron, Yotam. "Information privacy for linked data." Thesis, Massachusetts Institute of Technology, 2013. http://hdl.handle.net/1721.1/85215.

Full text

Abstract:

Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2013.
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Cataloged from student-submitted PDF version of thesis.
Includes bibliographical references (pages 77-79).
As data mining over massive amounts of linked data becomes more and more prevalent in research applications, information privacy becomes a more important issue. This is especially true in the biological and medical fields, where information sensitivity is high. Previous experience has shown that simple anonymization techniques, such as removing an individual's name from a data set, are inadequate to fully protect the data's participants. While strong privacy guarantees have been studied for relational databases, these are virtually non-existent for graph-structured linked data. This line of research is important, however, since the aggregation of data across different web sources may lead to privacy leaks. The ontological structure of linked data especially aids these attacks on privacy. The purpose of this thesis is two-fold. The first is to investigate differential privacy, a strong privacy guarantee, and how to construct differentially-private mechanisms for linked data. The second involves the design and implementation of the SPARQL Privacy Insurance Module (SPIM). Using a combination of well-studied techniques, such as authentication and access control, and the mechanisms developed to maintain differential privacy over linked data, it attempts to limit privacy hazards for SPARQL queries. By using these privacy-preservation techniques, data owners may be more willing to share their data sets with other researchers without the fear that it will be misused. Consequently, we can expect greater sharing of information, which will foster collaboration and improve the types of data that researchers can have access to.
by Yotam Aron.
M. Eng.

APA, Harvard, Vancouver, ISO, and other styles

42

Deng, Mo Ph D. Massachusetts Institute of Technology. "On compression of encrypted data." Thesis, Massachusetts Institute of Technology, 2016. http://hdl.handle.net/1721.1/106100.

Full text

Abstract:

Thesis: S.M. in Electrical Engineering, Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2016.
Cataloged from PDF version of thesis.
Includes bibliographical references (pages 93-96).
In this thesis, I took advantage of a model-free compression architecture, where the encoder only makes decision about coding and leaves to the decoder to apply the knowledge of the source for decoding, to attack the problem of compressing encrypted data. Results for compressing different sources encrypted by different class of ciphers are shown and analyzed. Moreover, we generalize the problem from encryption schemes to operations, or data-processing techniques. We try to discover key properties an operation should have, in order to enable good post-operation compression performances.
by Mo Deng.
S.M. in Electrical Engineering

APA, Harvard, Vancouver, ISO, and other styles

43

Sarkar, Tuhin. "Learning structure from unstructured data." Thesis, Massachusetts Institute of Technology, 2020. https://hdl.handle.net/1721.1/127005.

Full text

Abstract:

Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, May, 2020
Cataloged from the official PDF of thesis.
Includes bibliographical references (pages 277-285).
This thesis develops statistical tools and algorithmic techniques for non-asymptotic system identification of dynamical systems from noisy input-output data. Specifically, we address the question: "For a fixed length of noisy data generated by an unknown model, what is the best approximation that can be estimated?"; this is in contrast to traditional system identification which answers the question of estimating the unknown model when data length tends to infinity. The importance of such analyses and tools cannot be overstated in applications such as reinforcement learning where a popular design principle is system identification for control. Typically, in such settings we are presented with two problems: first, we are given access only to a finite noisy data set; and second, the hidden state dimension or model order is unknown. The first problem limits our ability to comment on the finite time performance of estimation algorithms; and the second problem prevents appropriate parametrizations for model identification. The goal of this thesis is to address these issues for a large class of dynamical systems. The premise of our approach relies on the existence of suitable low order approximations of the true model that can be constructed from finite, albeit noisy, data. Since the true model order is apriori unknown, we simply estimate low order approximations of this model from data. The order of these approximations grow as we accumulate more data. By such a method, we construct consistent finite time estimators of the underlying data generating model. This principle of constructing low order estimates directly from data is different from the status quo of constructing the largest possible model and then performing a reduction procedure to obtain estimates. We show that in many cases our method outperforms existing algorithms in finite time.
by Tuhin Sarkar.
Ph. D.
Ph.D. Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science

APA, Harvard, Vancouver, ISO, and other styles

44

Fok, Lordique(Lordique S. ). "Techniques for structured data discovery." Thesis, Massachusetts Institute of Technology, 2019. https://hdl.handle.net/1721.1/121671.

Full text

Abstract:

This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2019
Cataloged from student-submitted PDF version of thesis.
Includes bibliographical references (pages 63-64).
The discovery of structured data, or data that is tagged by key-value pairs, is a problem that can be subdivided into two issues: how best to structure information architecture and user interaction for discovery; and how to intelligently display data in a way that that optimizes the discovery of "useful" (i.e. relevant and helpful for a user's current use case) data. In this thesis, I investigate multiple methods of addressing both issues, and the results of evaluating these methods qualitatively and quantitatively. Specifically, I implement and evaluate: a novel interface design which combines different aspects of existing interfaces, two methods of diversifying data subsets given a search query, three methods of incorporating relevance in data subsets given a search query and information about the user's historic queries, a novel method of visualizing structured data, and two methods of inducing hierarchy on structured data in the presence of an partial data schema. These implementations and evaluations are shown to be effective in structuring information architecture and user interaction for structured data discovery, but are only partially effective in intelligently displaying data to optimize discovery of useful structured data.
by Lordique Fok.
M. Eng.
M.Eng. Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science

APA, Harvard, Vancouver, ISO, and other styles

45

Cheelangi, Madhusudan. "Result Distribution in Big Data Systems." Thesis, University of California, Irvine, 2013. http://pqdtopen.proquest.com/#viewpdf?dispub=1539891.

Full text

Abstract:

We are building a Big Data Management System (BDMS) called AsterixDB at UCI. Since AsterixDB is designed to operate on large volumes of data, the results for its queries can be potentially very large, and AsterixDB is also designed to operate under high concurency workloads. As a result, we need a specialized mechanism to manage these large volumes of query results and deliver them to the clients. In this thesis, we present an architecture and an implementation of a new result distribution framework that is capable of handling large volumes of results under high concurency workloads. We present the various components of this result distribution framework and show how they interact with each other to manage large volumes of query results and deliver them to clients. We also discuss various result distribution policies that are possible with our framework and compare their performance through experiments.

We have implemented a REST-like HTTP client interface on top of the result distribution framework to allow clients to submit queries and obtain their results. This client interface provides two modes for clients to choose from to read their query results: synchronous mode and asynchronous mode. In synchronous mode, query results are delivered to a client as a direct response to its query within the same request-response cycle. In asynchronous mode, a query handle is returned instead to the client as a response to its query. The client can store the handle and send another request later, including the query handle, to read the result for the query whenever it wants. The architectural support for these two modes is also described in this thesis. We believe that the result distribution framework, combined with this client interface, successfully meets the result management demands of AsterixDB.

APA, Harvard, Vancouver, ISO, and other styles

46

Tibbetts, Kevin (Kevin Joseph). "Data mining for structure type prediction." Thesis, Massachusetts Institute of Technology, 2004. http://hdl.handle.net/1721.1/34413.

Full text

Abstract:

Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Materials Science and Engineering, 2004.
Includes bibliographical references (p. 41-42).
Determining the stable structure types of an alloy is critical to determining many properties of that material. This can be done through experiment or computation. Both methods can be expensive and time consuming. Computational methods require energy calculations of hundreds of structure types. Computation time would be greatly improved if this large number of possible structure types was reduced. A method is discussed here to predict the stable structure types for an alloy based on compiled data. This would include experimentally observed stable structure types and calculated energies of structure types. In this paper I will describe the state of this technology. This will include an overview of past and current work. Curtarolo et al. showed a factor of three improvement in the number of calculations required to determine a given percentage of the ground state structure types for an alloy system by using correlations among a database of over 6000 calculated energies.I will show correlations among experimentally determined stable structure types appearing in the same alloy system through statistics computed from the Pauling File Inorganic Materials Database Binaries edition. I will compare a method to predict stable structure types based on correlations among pairs of structure types that appear in the same alloy system with a method based simply on the frequency of occurrence of each structure type. I will show a factor of two improvement in the number of calculations required to determine the ground state structure types between these two methods. This paper will examine the potential market value for a software tool used to predict likely stable structure types. A timeline for introduction of this product and an analysis of the market for such a tool will be included. There is no established market for structure type prediction software, but the market will be similar to that of materials database software and energy calculation software.The potential market is small, but the production and maintenance costs are also small. These small costs, combined with the potential of this tool to improve greatly over time, make this a potentially promising investment. These methods are still in development. The key to the value of this tool lies in the accuracy of the prediction methods developed over the next few years.
by Kevin Tibbetts.
M.Eng.

APA, Harvard, Vancouver, ISO, and other styles

47

Kim, Edward Soo. "Data-mining natural language materials syntheses." Thesis, Massachusetts Institute of Technology, 2019. https://hdl.handle.net/1721.1/122075.

Full text

Abstract:

This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Thesis: Ph. D., Massachusetts Institute of Technology, Department of Materials Science and Engineering, 2019
Cataloged from student-submitted PDF version of thesis.
Includes bibliographical references.
Discovering, designing, and developing a novel material is an arduous task, involving countless hours of human effort and ingenuity. While some aspects of this process have been vastly accelerated by the advent of first-principles-based computational techniques and high throughput experimental methods, a vast ocean of untapped historical knowledge lies dormant in the scientific literature. Namely, the precise methods by which many inorganic compounds are synthesized are recorded only as text within journal articles. This thesis aims to realize the potential of this data for informing the syntheses of inorganic materials through the use of data-mining algorithms. Critically, the methods used and produced in this thesis are fully automated, thus maximizing the impact for accelerated synthesis planning by human researchers.
There are three primary objectives of this thesis: 1) aggregate and codify synthesis knowledge contained within scientific literature, 2) identify synthesis "driving factors" for different synthesis outcomes (e.g., phase selection) and 3) autonomously learn synthesis hypotheses from the literature and extend these hypotheses to predicted syntheses for novel materials. Towards the first goal of this thesis, a pipeline of algorithms is developed in order to extract and codify materials synthesis information from journal articles into a structured, machine readable format, analogous to existing databases for materials structures and properties. To efficiently guide the extraction of materials data, this pipeline leverages domain knowledge regarding the allowable relations between different types of information (e.g., concentrations often correspond to solutions).
Both unsupervised and supervised machine learning algorithms are also used to rapidly extract synthesis information from the literature. To examine the autonomous learning of driving factors for morphology selection during hydrothermal syntheses, TiO₂ nanotube formation is found to be correlated with NaOH concentrations and reaction temperatures, using models that are given no internal chemistry knowledge. Additionally, the capacity for transfer learning is shown by predicting phase symmetry in materials systems unseen by models during training, outperforming heuristic physically-motivated baseline stratgies, and again with chemistry-agnostic models. These results suggest that synthesis parameters possess some intrinsic capability for predicting synthesis outcomes. The nature of this linkage between synthesis parameters and synthesis outcomes is then further explored by performing virtual synthesis parameter screening using generative models.
Deep neural networks (variational autoencoders) are trained to learn low-dimensional representations of synthesis routes on augmented datasets, created by aggregated synthesis information across materials with high structural similarity. This technique is validated by predicting ion-mediated polymorph selection effects in MnO₂, using only data from the literature (i.e., without knowledge of competing free energies). This method of synthesis parameter screening is then applied to suggest a new hypothesis for solvent-driven formation of the rare TiO₂ phase, brookite. To extend the capability of synthesis planning with literature-based generative models, a sequence-based conditional variational autoencoder (CVAE) neural network is developed. The CVAE allows a materials scientist to query the model for synthesis suggestions of arbitrary materials, including those that the model has not observed before.
In a demonstrative experiment, the CVAE suggests the correct precursors for literature-reported syntheses of two perovskite materials using training data published more than a decade prior to the target syntheses. Thus, the CVAE is used as an additional materials synthesis screening utility that is complementary to techniques driven by density functional theory calculations. Finally, this thesis provides a broad commentary on the status quo for the reporting of written materials synthesis methods, and suggests a new format which improves both human and machine readability. The thesis concludes with comments on promising future directions which may build upon the work described in this document.
by Edward Soo Kim.
Ph. D.
Ph.D. Massachusetts Institute of Technology, Department of Materials Science and Engineering

APA, Harvard, Vancouver, ISO, and other styles

48

Anderson, Alec W. (Alec Wayne). "Deep Mining : scaling Bayesian auto-tuning of data science pipelines." Thesis, Massachusetts Institute of Technology, 2017. http://hdl.handle.net/1721.1/119509.

Full text

Abstract:

Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2017.
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Cataloged from student-submitted PDF version of thesis.
Includes bibliographical references (pages 105-108).
Within the automated machine learning movement, hyperparameter optimization has emerged as a particular focus. Researchers have introduced various search algorithms and open-source systems in order to automatically explore the hyperparameter space of machine learning methods. While these approaches have been effective, they also display significant shortcomings that limit their applicability to realistic data science pipelines and datasets. In this thesis, we propose an alternative theoretical and implementational approach by incorporating sampling techniques and building an end-to-end automation system, Deep Mining. We explore the application of the Bag of Little Bootstraps to the scoring statistics of pipelines, describe substantial asymptotic complexity improvements from its use, and empirically demonstrate its suitability for machine learning applications. The Deep Mining system combines a standardized approach to pipeline composition, a parallelized system for pipeline computation, and clear abstractions for incorporating realistic datasets and methods to provide hyperparameter optimization at scale.
by Alec W. Anderson.
M. Eng.

APA, Harvard, Vancouver, ISO, and other styles

49

Tutcher, Jonathan. "Development of semantic data models to support data interoperability in the rail industry." Thesis, University of Birmingham, 2016. http://etheses.bham.ac.uk//id/eprint/6774/.

Full text

Abstract:

Railways are large, complex systems that comprise many heterogeneous subsystems and parts. As the railway industry continues to enjoy increasing passenger and freight custom, ways of deriving greater value from the knowledge within these subsystems are increasingly sought. Interfaces to and between systems are rare, making data sharing and analysis difficult. Semantic data modelling provides a method of integrating data from disparate sources by encoding knowledge about a problem domain or world into machine-interpretable logic and using this knowledge to encode and infer data context and meaning. The uptake of this technique in the Semantic Web and Linked Data movements in recent years has provided a mature set of techniques and toolsets for designing and implementing ontologies and linked data applications. This thesis demonstrates ways in which semantic data models and OWL ontologies can be used to foster data exchange across the railway industry. It sets out a novel methodology for the creation of industrial semantic models, and presents a new set of railway domain ontologies to facilitate integration of infrastructure-centric railway data. Finally, the design and implementation of two prototype systems is described, each of which use the techniques and ontologies in solving a known problem.

APA, Harvard, Vancouver, ISO, and other styles

50

Lu, Feng. "Big data scalability for high throughput processing and analysis of vehicle engineering data." Thesis, KTH, Skolan för informations- och kommunikationsteknik (ICT), 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-207084.

Full text

Abstract:

"Sympathy for Data" is a platform that is utilized for Big Data automation analytics. It is based on visual interface and workflow configurations. The main purpose of the platform is to reuse parts of code for structured analysis of vehicle engineering data. However, there are some performance issues on a single machine for processing a large amount of data in Sympathy for Data. There are also disk and CPU IO intensive issues when the data is oversized and the platform need fits comfortably in memory. In addition, for data over the TB or PB level, the Sympathy for data needs separate functionality for efficient processing simultaneously and scalable for distributed computation functionality. This paper focuses on exploring the possibilities and limitations in using the Sympathy for Data platform in various data analytic scenarios within the Volvo Cars vision and strategy. This project re-writes the CDE workflow for over 300 nodes into pure Python script code and make it executable on the Apache Spark and Dask infrastructure. We explore and compare both distributed computing frameworks implemented on Amazon Web Service EC2 used for 4 machine with a 4x type for distributed cluster measurement. However, the benchmark results show that Spark is superior to Dask from performance perspective. Apache Spark and Dask will combine with Sympathy for Data products for a Big Data processing engine to optimize the system disk and CPU IO utilization. There are several challenges when using Spark and Dask to analyze large-scale scientific data on systems. For instance, parallel file systems are shared among all computing machines, in contrast to shared-nothing architectures. Moreover, accessing data stored in commonly used scientific data formats, such as HDF5 is not tentatively supported in Spark. This report presents research carried out on the next generation of Big Data platforms in the automotive industry called "Sympathy for Data". The research questions focusing on improving the I/O performance and scalable distributed function to promote Big Data analytics. During this project, we used the Dask.Array parallelism features for interpretation the data sources as a raster shows in table format, and Apache Spark used as data processing engine for parallelism to load data sources to memory for improving the big data computation capacity. The experiments chapter will demonstrate 640GB of engineering data benchmark for single node and distributed computation mode to evaluate the Sympathy for Data Disk CPU and memory metrics. Finally, the outcome of this project improved the six times performance of the original Sympathy for data by developing a middleware SparkImporter. It is used in Sympathy for Data for distributed computation and connected to the Apache Spark for data processing through the maximum utilization of the system resources. This improves its throughput, scalability, and performance. It also increases the capacity of the Sympathy for data to process Big Data and avoids big data cluster infrastructures.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Data engineering and data science'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles