Dissertations / Theses: 'Large sets'

1

Ziegler, Albert. "Large sets in constructive set theory." Thesis, University of Leeds, 2014. http://etheses.whiterose.ac.uk/8370/.

Full text

Abstract:

This thesis presents an investigation into large sets and large set axioms in the context of the constructive set theory CZF. We determine the structure of large sets by classifying their von Neumann stages and use a new modified cumulative hierarchy to characterise their arrangement in the set theoretic universe. We prove that large set axioms have good metamathematical properties, including absoluteness for the common relative model constructions of CZF and a preservation of the witness existence properties CZF enjoys. Furthermore, we use realizability to establish new results about the relative consistency of a plurality of inaccessibles versus the existence of just one inaccessible. Developing a constructive theory of clubs, we present a characterisation theorem for Mahlo sets connecting classical and constructive approaches to Mahloness and determine the amount of induction contained in the assertion of a Mahlo set. We then present a characterisation theorem for 2-strong sets which proves them to be equivalent to a logically simpler concept. We also investigate several topics connected to elementary embeddings of the set theoretic universe into a transitive class model of CZF, where considering different equivalent classical formulations results in a rich and interconnected spectrum of measurability for the constructive case. We pay particular attention to the question of cofinality of elementary embeddings, achieving both very strong cofinality properties in the case of Reinhardt embeddings and constructing models of the failure of cofinality in the case of ordinary measurable embeddings, some of which require only surprisingly low conditions. We close with an investigation of constructive principles incompatible with elementary embeddings.

APA, Harvard, Vancouver, ISO, and other styles

2

Kleinberg, Robert David. "Online decision problems with large strategy sets." Thesis, Massachusetts Institute of Technology, 2005. http://hdl.handle.net/1721.1/33092.

Full text

Abstract:

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Mathematics, 2005.
Includes bibliographical references (p. 165-171).
In an online decision problem, an algorithm performs a sequence of trials, each of which involves selecting one element from a fixed set of alternatives (the "strategy set") whose costs vary over time. After T trials, the combined cost of the algorithm's choices is compared with that of the single strategy whose combined cost is minimum. Their difference is called regret, and one seeks algorithms which are efficient in that their regret is sublinear in T and polynomial in the problem size. We study an important class of online decision problems called generalized multi- armed bandit problems. In the past such problems have found applications in areas as diverse as statistics, computer science, economic theory, and medical decision-making. Most existing algorithms were efficient only in the case of a small (i.e. polynomial- sized) strategy set. We extend the theory by supplying non-trivial algorithms and lower bounds for cases in which the strategy set is much larger (exponential or infinite) and the cost function class is structured, e.g. by constraining the cost functions to be linear or convex. As applications, we consider adaptive routing in networks, adaptive pricing in electronic markets, and collaborative decision-making by untrusting peers in a dynamic environment.
by Robert David Kleinberg.
Ph.D.

APA, Harvard, Vancouver, ISO, and other styles

3

Villalba, Michael Joseph. "Fast visual recognition of large object sets." Thesis, Massachusetts Institute of Technology, 1990. http://hdl.handle.net/1721.1/42211.

Full text

APA, Harvard, Vancouver, ISO, and other styles

4

Arvidsson, Johan. "Finding delta difference in large data sets." Thesis, Luleå tekniska universitet, Datavetenskap, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:ltu:diva-74943.

Full text

Abstract:

To find out what differs between two versions of a file can be done with several different techniques and programs. These techniques and programs are often focusd on finding differences in text files, in documents, or in class files for programming. An example of a program is the popular git tool which focuses on displaying the difference between versions of files in a project. A common way to find these differences is to utilize an algorithm called Longest common subsequence, which focuses on finding the longest common subsequence in each file to find similarity between the files. By excluding all similarities in a file, all remaining text will be the differences between the files. The Longest Common Subsequence is often used to find the differences in an acceptable time. When two lines in a file is compared to see if they differ from each other hashing is used. The hash values for each correspondent line in both files will be compared. Hashing a line will give the content on that line a unique value. If as little as one character on a line is different between the version, the hash values for those lines will be different as well. These techniques are very useful when comparing two versions of a file with text content. With data from a database some, but not all, of these techniques can be useful. A big difference between data in a database and text in a file will be that content is not just added and delete but also updated. This thesis studies the problem on how to make use of these techniques when finding differences between large datasets, and doing this in a reasonable time, instead of finding differences in documents and files. Three different methods are going to be studied in theory. These results will be provided in both time and space complexities. Finally, a selected one of these methods is further studied with implementation and testing. The reason only one of these three is implemented is because of time constraint. The one that got chosen had easy maintainability, an easy implementation, and maintains a good execution time.

APA, Harvard, Vancouver, ISO, and other styles

5

Bate, Steven Mark. "Generalized linear models for large dependent data sets." Thesis, University College London (University of London), 2004. http://discovery.ucl.ac.uk/1446542/.

Full text

Abstract:

Generalized linear models (GLMs) were originally used to build regression models for independent responses. In recent years, however, effort has focused on extending the original GLM theory to enable it to be applied to data which exhibit dependence in the responses. This thesis focuses on some specific extensions of the GLM theory for dependent responses. A new hypothesis testing technique is proposed for the application of GLMs to cluster dependent data. The test is based on an adjustment to the 'independence' likelihood ratio test, which allows for the within cluster dependence. The performance of the new test, in comparison to established techniques, is explored. The application of the generalized estimating equations (GEE) methodology to model space-time data is also investigated. The approach allows for the temporal dependence via the covariates and models the spatial dependence using techniques from geostatistics. The application area of climatology has been used to motivate much of the work undertaken. A key attribute of climate data sets, in addition to exhibiting dependence both spatially and temporally, is that they are typically large in size, often running into millions of observations. Therefore, throughout the thesis, particular attention has focused on computational issues, to enable analysis to be undertaken in a feasible time frame. For example, we investigate the use of the GEE one-step estimator in situations where the application of the full algorithm is impractical. The final chapter of this thesis presents a climate case study. This involves wind speeds over northwestern Europe, which we analyse using the techniques developed.

APA, Harvard, Vancouver, ISO, and other styles

6

Cordeiro, Robson Leonardo Ferreira. "Data mining in large sets of complex data." Universidade de São Paulo, 2011. http://www.teses.usp.br/teses/disponiveis/55/55134/tde-22112011-083653/.

Full text

Abstract:

Due to the increasing amount and complexity of the data stored in the enterprises\' databases, the task of knowledge discovery is nowadays vital to support strategic decisions. However, the mining techniques used in the process usually have high computational costs that come from the need to explore several alternative solutions, in different combinations, to obtain the desired knowledge. The most common mining tasks include data classification, labeling and clustering, outlier detection and missing data prediction. Traditionally, the data are represented by numerical or categorical attributes in a table that describes one element in each tuple. Although the same tasks applied to traditional data are also necessary for more complex data, such as images, graphs, audio and long texts, the complexity and the computational costs associated to handling large amounts of these complex data increase considerably, making most of the existing techniques impractical. Therefore, especial data mining techniques for this kind of data need to be developed. This Ph.D. work focuses on the development of new data mining techniques for large sets of complex data, especially for the task of clustering, tightly associated to other data mining tasks that are performed together. Specifically, this Doctoral dissertation presents three novel, fast and scalable data mining algorithms well-suited to analyze large sets of complex data: the method Halite for correlation clustering; the method BoW for clustering Terabyte-scale datasets; and the method QMAS for labeling and summarization. Our algorithms were evaluated on real, very large datasets with up to billions of complex elements, and they always presented highly accurate results, being at least one order of magnitude faster than the fastest related works in almost all cases. The real data used come from the following applications: automatic breast cancer diagnosis, satellite imagery analysis, and graph mining on a large web graph crawled by Yahoo! and also on the graph with all users and their connections from the Twitter social network. Such results indicate that our algorithms allow the development of real time applications that, potentially, could not be developed without this Ph.D. work, like a software to aid on the fly the diagnosis process in a worldwide Healthcare Information System, or a system to look for deforestation within the Amazon Rainforest in real time
O crescimento em quantidade e complexidade dos dados armazenados nas organizações torna a extração de conhecimento utilizando técnicas de mineração uma tarefa ao mesmo tempo fundamental para aproveitar bem esses dados na tomada de decisões estratégicas e de alto custo computacional. O custo vem da necessidade de se explorar uma grande quantidade de casos de estudo, em diferentes combinações, para se obter o conhecimento desejado. Tradicionalmente, os dados a explorar são representados como atributos numéricos ou categóricos em uma tabela, que descreve em cada tupla um caso de teste do conjunto sob análise. Embora as mesmas tarefas desenvolvidas para dados tradicionais sejam também necessárias para dados mais complexos, como imagens, grafos, áudio e textos longos, a complexidade das análises e o custo computacional envolvidos aumentam significativamente, inviabilizando a maioria das técnicas de análise atuais quando aplicadas a grandes quantidades desses dados complexos. Assim, técnicas de mineração especiais devem ser desenvolvidas. Este Trabalho de Doutorado visa a criação de novas técnicas de mineração para grandes bases de dados complexos. Especificamente, foram desenvolvidas duas novas técnicas de agrupamento e uma nova técnica de rotulação e sumarização que são rápidas, escaláveis e bem adequadas à análise de grandes bases de dados complexos. As técnicas propostas foram avaliadas para a análise de bases de dados reais, em escala de Terabytes de dados, contendo até bilhões de objetos complexos, e elas sempre apresentaram resultados de alta qualidade, sendo em quase todos os casos pelo menos uma ordem de magnitude mais rápidas do que os trabalhos relacionados mais eficientes. Os dados reais utilizados vêm das seguintes aplicações: diagnóstico automático de câncer de mama, análise de imagens de satélites, e mineração de grafos aplicada a um grande grafo da web coletado pelo Yahoo! e também a um grafo com todos os usuários da rede social Twitter e suas conexões. Tais resultados indicam que nossos algoritmos permitem a criação de aplicações em tempo real que, potencialmente, não poderiam ser desenvolvidas sem a existência deste Trabalho de Doutorado, como por exemplo, um sistema em escala global para o auxílio ao diagnóstico médico em tempo real, ou um sistema para a busca por áreas de desmatamento na Floresta Amazônica em tempo real

APA, Harvard, Vancouver, ISO, and other styles

7

Chaudhary, Amitabh. "Applied spatial data structures for large data sets." Available to US Hopkins community, 2002. http://wwwlib.umi.com/dissertations/dlnow/3068131.

Full text

APA, Harvard, Vancouver, ISO, and other styles

8

Hennessey, Anthony. "Statistical shape analysis of large molecular data sets." Thesis, University of Nottingham, 2018. http://eprints.nottingham.ac.uk/52088/.

Full text

Abstract:

Protein classification databases are widely used in the prediction of protein structure and function, and amongst these databases the manually-curated Structural Classification of Proteins database (SCOP) is considered to be a gold standard. In SCOP, functional relationships are described by hyperfamily and superfamily categories and structural relationships are described by family, species and protein categories. We present a method to calculate a difference measure between pairs of proteins that can be used to reproduce SCOP2 structural relationship classifications, and that can also be used to reproduce a subset of functional relationship classifications at the superfamily level. Calculating the difference measure requires first finding the best correspondence between atoms in two protein configurations. The problem of finding the best correspondence is known as the unlabelled, partial matching problem. We consider the unlabelled, partial matching problem through a detailed analysis of the approach presented in Green and Mardia (2006). Using this analysis, and applying domain-specific constraints, we develop a new algorithm called GProtA for protein structure alignment. The proposed difference measure is constructed from the root mean squared deviation of the aligned protein structures and a binary similarity measure, where the binary similarity measure takes into account the proportions of atoms matching from each configuration. The GProtA algorithm and difference measure are applied to protein structure data taken from the Protein Data Bank. The difference measure is shown to correctly classify 62 of a set of 72 proteins into the correct SCOP family categories when clustered. Of the remaining 9 proteins, 2 are assigned incorrectly and 7 are considered indeterminate. In addition, a method for deriving characteristic signatures for categories is proposed. The signatures offer a mechanism by which a single comparison can be made to judge similarity to a particular category. Comparison using characteristic signatures is shown to correctly delineate proteins at the family level, including the identification of both families for a subset of proteins described by two family level categories.

APA, Harvard, Vancouver, ISO, and other styles

9

Huet, Benoit. "Object recognition from large libraries of line patterns." Thesis, University of York, 1999. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.298533.

Full text

APA, Harvard, Vancouver, ISO, and other styles

10

Thorarinsson, Johann Sigurdur. "ruleViz : visualization of large rule sets and composite events." Thesis, University of Skövde, School of Humanities and Informatics, 2008. http://urn.kb.se/resolve?urn=urn:nbn:se:his:diva-2286.

Full text

Abstract:

Event Condition Action rule engines have been developed for some time now. Theycan respond automatically to events coming from different sources. Combination ofdifferent event types may be different from time to time and there for it is hard todetermine how the rule engine executes its rules. Especially when the engine is givena large rule set to work with. To determine the behavior is to run tests on the ruleengine and see the final results, but if the results are wrong it can be hard to see whatwent wrong. ruleViz is a program that can look at the execution and visually animatethe rule engine behavior by showing connections between rules and composite events,making it easier for the operator to see what causes the fault. ruleViz is designed toembrace Human Computer Interaction (HCI) methods, making its interfaceunderstandable and easy to operate.

APA, Harvard, Vancouver, ISO, and other styles

11

Dementiev, Roman. "Algorithm engineering for large data sets hardware, software, algorithms." Saarbrücken VDM, Müller, 2006. http://d-nb.info/986494429/04.

Full text

APA, Harvard, Vancouver, ISO, and other styles

12

Dementiev, Roman. "Algorithm engineering for large data sets : hardware, software, algorithms /." Saarbrücken : VDM-Verl. Dr. Müller, 2007. http://deposit.d-nb.de/cgi-bin/dokserv?id=3029033&prov=M&dok_var=1&dok_ext=htm.

Full text

APA, Harvard, Vancouver, ISO, and other styles

13

Nair, Sumitra Sarada. "Function estimation using kernel methods for large data sets." Thesis, University of Sheffield, 2007. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.444581.

Full text

APA, Harvard, Vancouver, ISO, and other styles

14

Rauschenberg, David Edward. "Computer-graphical exploration of large data sets from teletraffic." Diss., The University of Arizona, 1994. http://hdl.handle.net/10150/186645.

Full text

Abstract:

The availability of large data sets and powerful computing resources has made data analysis an increasingly viable approach to understanding random processes. Of particular interest are exploratory techniques which provide insight into the local path behavior of highly positively correlated processes. We focus on actual and simulated teletraffic data in the form of time series. Our foremost objective is to develop a methodology of identifying and classifying shape features which are essentially unrecognizable with standard statistical descriptors. Using basic aspects of human vision as a heuristic guide, we have developed an algorithm which "sketches" data sequences. Our approach to summarizing path behavior is based on exploiting the simple structure of a sketch. We have developed a procedure whereby all the "shapes" of a sketch are summarized in a visually comprehensible manner. We do so by placing the shapes in classes, then displaying, for each class, both a representative shape and the number of shapes in the class. These "shape histograms" can provide substantial insight into the behavior of sample paths. We have also used sketches to help model data sequences. The idea here is that a model based on a sketch of a data sequence may provide a better fit under some circumstances than a model based directly on the data. By considering various sketches, one could, for example, develop a Markov chain model whose autocorrelation function approximates that of the original data. We have generalized this use of sketches so that a data sequence can be modeled as the superposition of several sketches, each capturing a different level of detail. Because the concept of path shape is highly visual, it is important that our techniques exploit the strengths of and accommodate for the weaknesses of human vision. We have addressed this by using computer graphics in a variety of novel ways.

APA, Harvard, Vancouver, ISO, and other styles

15

Farran, Bassam. "One-pass algorithms for large and shifting data sets." Thesis, University of Southampton, 2010. https://eprints.soton.ac.uk/159173/.

Full text

Abstract:

For many problem domains, practitioners are faced with the problem of ever-increasing amounts of data. Examples include the UniProt database of proteins which now contains ~6 million sequences, and the KDD ’99 data which consists of ~5 million points. At these scales, the state-of-the-art machine learning techniques are not applicable since the multiple passes they require through the data are prohibitively expensive, and a need for different approaches arises. Another issue arising in real-world tasks, which is only recently becoming a topic of interest in the machine learning community, is distribution shift, which occurs naturally in many problem domains such as intrusion detection and EEG signal mapping in the Brain-Computer Interface domain. This means that the i.i.d. assumption between the training and test data does not hold, causing classifiers to perform poorly on the unseen test set. We first present a novel, hierarchical, one-pass clustering technique that is capable of handling very large data. Our experiments show that the quality of the clusters generated by our method does not degrade, while making vast computational savings compared to algorithms that require multiple passes through the data. We then propose Voted Spheres, a novel, non-linear, one-pass, multi-class classification technique capable of handling millions of points in minutes. Our empirical study shows that it achieves state-of-the-art performance on real world data sets, in a fraction of the time required by other methods. We then adapt the VS to deal with covariate shift between the training and test phases using two different techniques: an importance weighting scheme and kernel mean matching. Our results on a toy problem and the real-world KDD ’99 data show an increase in performance to our VS framework. Our final contribution involves applying the one-pass VS algorithm, along with the adapted counterpart (for covariate shift), to the Brain-Computer Interface domain, in which linear batch algorithms are generally used. Our VS-based methods outperform the SVM, and perform very competitively with the submissions of a recent BCI competition, which further shows the robustness of our proposed techniques to different problem domains.

APA, Harvard, Vancouver, ISO, and other styles

16

Mangalvedkar, Pallavi Ramachandra. "GPU-ASSISTED RENDERING OF LARGE TREE-SHAPED DATA SETS." Wright State University / OhioLINK, 2007. http://rave.ohiolink.edu/etdc/view?acc_num=wright1195491112.

Full text

APA, Harvard, Vancouver, ISO, and other styles

17

Yeh, Jieh-Shan George. "Large sets of disjoint t-(v,k,[lambda]) designs /." The Ohio State University, 1999. http://rave.ohiolink.edu/etdc/view?acc_num=osu1488193272069635.

Full text

APA, Harvard, Vancouver, ISO, and other styles

18

Anderson, James D. "Interactive Visualization of Search Results of Large Document Sets." Wright State University / OhioLINK, 2018. http://rave.ohiolink.edu/etdc/view?acc_num=wright1547048073451373.

Full text

APA, Harvard, Vancouver, ISO, and other styles

19

Toulis, Panagiotis. "Implicit methods for iterative estimation with large data sets." Thesis, Harvard University, 2016. http://nrs.harvard.edu/urn-3:HUL.InstRepos:33493434.

Full text

Abstract:

The ideal estimation method needs to fulfill three requirements: (i) efficient computation, (ii) statistical efficiency, and (iii) numerical stability. The classical stochastic approximation of (Robbins, 1951) is an iterative estimation method, where the current iterate (parameter estimate) is updated according to some discrepancy between what is observed and what is expected assuming the current iterate has the true parameter value. Classical stochastic approximation undoubtedly meets the computation requirement, which explains its widespread popularity, for example, in modern applications of machine learning with large data sets, but cannot effectively combine it with efficiency and stability. Surprisingly, the stability issue can be improved substantially, if the aforementioned discrepancy is computed not using the current iterate, but using the conditional expectation of the next iterate given the current one. The computational overhead of the resulting implicit update is minimal for many statistical models, whereas statistical efficiency can be achieved through simple averaging of the iterates, as in classical stochastic approximation (Ruppert, 1988). Thus, implicit stochastic approximation is fast and principled, fulfills requirements (i-iii) for a number of popular statistical models including generalized linear models, M-estimation, and proportional hazards, and it is poised to become the workhorse of estimation with large data sets in statistical practice.
Statistics

APA, Harvard, Vancouver, ISO, and other styles

20

Posid, Tasha Irene. "The small-large divide: The development of infant abilities to discriminate small from large sets." Thesis, Boston College, 2015. http://hdl.handle.net/2345/bc-ir:104371.

Full text

Abstract:

Thesis advisor: Sara Cordes
Thesis advisor: Ellen Winner
Evidence suggests that humans and non-human animals have access to two distinct numerical representation systems: a precise "object-file" system used to visually track small quantities (<4) and an approximate, ratio-dependent analog magnitude system used to represent all natural numbers. Although many studies to date indicate that infants can discriminate exclusively small sets (e.g., 1 vs. 2, 2 vs. 3) or exclusively large sets (4 vs. 8, 8 vs. 16), a robust phenomenon exists whereby they fail to compare sets crossing this small-large boundary (2 vs. 4, 3 vs. 6) despite a seemingly favorable ratio of difference between the two set sizes. Despite these robust failures in infancy (up to 14 months), studies suggest that 3-year old children no longer encounter difficulties comparing small from large sets, yet little work has explored the development of this phenomenon between 14 months and 3 years of age. The present study investigates (1) when in development infants naturally overcome this inability to compare small vs. large sets, as well as (2) what factors may facilitate this ability: namely, perceptual variability and/or numerical language. Results from three cross-sectional studies indicate that infants begin to discriminate between small and large sets as early as 17 months of age. Furthermore, infants seemed to benefit from perceptual variability of the items in the set when making these discriminations. Moreover, although preliminary evidence suggests that a child's ability to verbally count may correlate with success on these discriminations, simply exposure to numerical language (in the form of adult modeling of labeling the cardinality and counting the set) does not affect performance
Thesis (PhD) — Boston College, 2015
Submitted to: Boston College. Graduate School of Arts and Sciences
Discipline: Psychology

APA, Harvard, Vancouver, ISO, and other styles

21

Ljung, Patric. "Efficient Methods for Direct Volume Rendering of Large Data Sets." Doctoral thesis, Norrköping : Department of Science and Technology, Linköping University, 2006. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-7232.

Full text

APA, Harvard, Vancouver, ISO, and other styles

22

Lam, Heidi Lap Mun. "Visual exploratory analysis of large data sets : evaluation and application." Thesis, University of British Columbia, 2008. http://hdl.handle.net/2429/839.

Full text

Abstract:

Large data sets are difficult to analyze. Visualization has been proposed to assist exploratory data analysis (EDA) as our visual systems can process signals in parallel to quickly detect patterns. Nonetheless, designing an effective visual analytic tool remains a challenge. This challenge is partly due to our incomplete understanding of how common visualization techniques are used by human operators during analyses, either in laboratory settings or in the workplace. This thesis aims to further understand how visualizations can be used to support EDA. More specifically, we studied techniques that display multiple levels of visual information resolutions (VIRs) for analyses using a range of methods. The first study is a summary synthesis conducted to obtain a snapshot of knowledge in multiple-VIR use and to identify research questions for the thesis: (1) low-VIR use and creation; (2) spatial arrangements of VIRs. The next two studies are laboratory studies to investigate the visual memory cost of image transformations frequently used to create low-VIR displays and overview use with single-level data displayed in multiple-VIR interfaces. For a more well-rounded evaluation, we needed to study these techniques in ecologically-valid settings. We therefore selected the application domain of web session log analysis and applied our knowledge from our first three evaluations to build a tool called Session Viewer. Taking the multiple coordinated view and overview + detail approaches, Session Viewer displays multiple levels of web session log data and multiple views of session populations to facilitate data analysis from the high-level statistical to the low-level detailed session analysis approaches. Our fourth and last study for this thesis is a field evaluation conducted at Google Inc. with seven session analysts using Session Viewer to analyze their own data with their own tasks. Study observations suggested that displaying web session logs at multiple levels using the overview + detail technique helped bridge between high-level statistical and low-level detailed session analyses, and the simultaneous display of multiple session populations at all data levels using multiple views allowed quick comparisons between session populations. We also identified design and deployment considerations to meet the needs of diverse data sources and analysis styles.

APA, Harvard, Vancouver, ISO, and other styles

23

Tricker, Edward A. "Detecting anomalous aggregations of data points in large data sets." Thesis, Imperial College London, 2009. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.512050.

Full text

APA, Harvard, Vancouver, ISO, and other styles

24

Uminsky, David. "Generalized Spectral Analysis for Large Sets of Approval Voting Data." Scholarship @ Claremont, 2003. https://scholarship.claremont.edu/hmc_theses/157.

Full text

Abstract:

Generalized Spectral analysis of approval voting data uses representation theory and the symmetry of the data to project the approval voting data into orthogonal and interpretable subspaces. Unfortunately, as the number of voters grows, the data space becomes prohibitively large to compute the decomposition of the data vector. To attack these large data sets we develop a method to partition the data set into equivalence classes, in order to drastically reduce the size of the space while retaining the necessary characteristics of the data set. We also make progress on the needed statistical tools to explain the results of the spectral analysis. The standard spectral analysis will be demonstrated, and our partitioning technique is applied to U.S. Senate roll call data.

APA, Harvard, Vancouver, ISO, and other styles

25

Reuter, Patrick. "Reconstruction and Rendering of Implicit Surfaces from Large Unorganized Point Sets." Phd thesis, Université Sciences et Technologies - Bordeaux I, 2003. http://tel.archives-ouvertes.fr/tel-00576950.

Full text

Abstract:

Les technologies récentes d'acquisition de données en trois dimensions fournissent un grand nombre de points non-structurés en trois dimension. Il est important de reconstruire une surface continue à partir de ces points non-structurés et de la visualiser. Dans ce document, nous présentons de nouvelles méthodes pour reconstruire des surfaces implicites à partir de grands ensembles de points non-structurés. Ces méthodes mettent en oeuvre des surfaces variationnelles reconstruites localement à partir de fonctions de base radiales, surfaces qui sont combinées entre elles par un mécanisme de partition de l'unité. Afin d'obtenir une visualisation interactive des surfaces générées, nous présentons également des techniques de rendu qui utilisent non seulement la surface implicite reconstruite, mais également l'ensemble de points initial. Une première technique de rendu à base de points s'adapte automatiquement en fonction de la position de l'observateur et de la taille de la fenêtre de visualisation, grâce à une structure hiérarchique à multirésolution, et une deuxième technique de rendu à base de points utilise la géométrie différentielle locale dans chaque point. Enfin, un grand nombre d'applications effectives ou d'applications potentielles des techniques précédentes sont présentées, telles que la construction interactive de textures solides à partir de points non-structurés, la reconstruction altimétrique de terrain en fonction des lignes de niveaux, ou encore la réparation de photographies abîmées.

APA, Harvard, Vancouver, ISO, and other styles

26

Lundell, Fredrik. "Out-of-Core Multi-Resolution Volume Rendering of Large Data Sets." Thesis, Linköpings universitet, Medie- och Informationsteknik, 2011. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-70162.

Full text

Abstract:

A modality device can today capture high resolution volumetric data sets and as the data resolutions increase so does the challenges of processing volumetric data through a visualization pipeline. Standard volume rendering pipelines often use a graphic processing unit (GPU) to accelerate rendering performance by taking beneficial use of the parallel architecture on such devices. Unfortunately, graphics cards have limited amounts of video memory (VRAM), causing a bottleneck in a standard pipeline. Multi-resolution techniques can be used to efficiently modify the rendering pipeline, allowing a sub-domain within the volume to be represented at different resolutions. The active resolution distribution is temporarily stored on the VRAM for rendering and the inactive parts are stored on secondary memory layers such as the system RAM or on disk. The active resolution set can be optimized to produce high quality renders while minimizing the amount of storage required. This is done by using a dynamic compression scheme which optimize the visual quality by evaluating user-input data. The optimized resolution of each sub-domain is then, on demand, streamed to the VRAM from secondary memory layers. Rendering a multi-resolution data set requires some extra care between boundaries of sub-domains. To avoid artifacts, an intrablock interpolation (II) sampling scheme capable of creating smooth transitions between sub-domains at arbitrary resolutions can be used. The result is a highly optimized rendering pipeline complemented with a preprocessing pipeline together capable of rendering large volumetric data sets in real-time.

APA, Harvard, Vancouver, ISO, and other styles

27

Månsson, Per. "Database analysis and managing large data sets in a trading environment." Thesis, Linköpings universitet, Databas och informationsteknik, 2014. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-104193.

Full text

Abstract:

Start-up companies today tend to find a need to scale up quickly and smoothly, to cover quickly increasing demands for the services they create. It is also always a necessity to save money and finding a cost-efficient solution which can meet the demands of the company. This report uses Amazon Web Services for infrastructure. It covers hosting databases on Elastic Computing Cloud, the Relational Database Serviceas well as Amazon DynamoDB for NoSQL storage are compared, benchmarked and evaluated.

APA, Harvard, Vancouver, ISO, and other styles

28

Jungic, Veselin. "Elementary, topological, and experimental approaches to the family of large sets." Thesis, National Library of Canada = Bibliothèque nationale du Canada, 1999. http://www.collectionscanada.ca/obj/s4/f2/dsk1/tape9/PQDD_0027/NQ51878.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

29

Carter, Caleb. "High Resolution Visualization of Large Scientific Data Sets Using Tiled Display." Fogler Library, University of Maine, 2007. http://www.library.umaine.edu/theses/pdf/CarterC2007.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

30

Memarsadeghi, Nargess. "Efficient algorithms for clustering and interpolation of large spatial data sets." College Park, Md. : University of Maryland, 2007. http://hdl.handle.net/1903/6839.

Full text

Abstract:

Thesis (Ph. D.) -- University of Maryland, College Park, 2007.
Thesis research directed by: Computer Science. Title from t.p. of PDF. Includes bibliographical references. Published by UMI Dissertation Services, Ann Arbor, Mich. Also available in paper.

APA, Harvard, Vancouver, ISO, and other styles

31

Sips, Mike. "Pixel-based visual data mining in large geo-spatial point sets /." Konstanz : Hartung-Gorre, 2006. http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&doc_number=014881714&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA.

Full text

APA, Harvard, Vancouver, ISO, and other styles

32

Coudret, Raphaël. "Stochastic modelling using large data sets : applications in ecology and genetics." Phd thesis, Université Sciences et Technologies - Bordeaux I, 2013. http://tel.archives-ouvertes.fr/tel-00865867.

Full text

Abstract:

There are two main parts in this thesis. The first one concerns valvometry, which is here the study of the distance between both parts of the shell of an oyster, over time. The health status of oysters can be characterized using valvometry in order to obtain insights about the quality of their environment. We consider that a renewal process with four states underlies the behaviour of the studied oysters. Such a hidden process can be retrieved from a valvometric signal by assuming that some probability density function linked with this signal, is bimodal. We then compare several estimators which take this assumption into account, including kernel density estimators.In another chapter, we compare several regression approaches, aiming at analysing transcriptomic data. To understand which explanatory variables have an effect on gene expressions, we apply a multiple testing procedure on these data, through the linear model FAMT. The SIR method may find nonlinear relations in such a context. It is however more commonly used when the response variable is univariate. A multivariate version of SIR was then developed. Procedures to measure gene expressions can be expensive. The sample size n of the corresponding datasets is then often small. That is why we also studied SIR when n is less than the number of explanatory variables p.

APA, Harvard, Vancouver, ISO, and other styles

33

Winter, Eitan E. "Evolutionary analyses of protein-coding genes using large biological data sets." Thesis, University of Oxford, 2006. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.427615.

Full text

APA, Harvard, Vancouver, ISO, and other styles

34

Mostafa, Nour. "Intelligent dynamic caching for large data sets in a grid environment." Thesis, Queen's University Belfast, 2013. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.602689.

Full text

Abstract:

Present and future distributed applications need to deal with very large PetaBytes (PB) datasets and increasing numbers of associated users and resources. The emergence of Grid-based systems as a potential solution for large computational and data management problems has initiated significant research activity in the area. Grid research can be divided into two at least areas: Data Grids and Computational Grids. The aims of Data Grids are to provide services for accessing, sharing and modifying large databases, the aims of Computational Grids are to provide services for sharing resources. The considerable increase in data production and data sharing within scientific communities has created the need for improvements in data access and data availability. It can be argued the problems associated with the management of very large datasets are not well serviced by current approaches. The thesis concentrates on one of the areas concerned in the access to distributed very large databases on Grid resources. To this end, it presents the design and implementation of partial replication system and a Grid caching system that mediates access to distributed data. Artificial intelligent (AI) techniques such as a neural network (NN) have been used as a prediction element of the model to determine user requirements by analysing the past history of the user. Hence, this thesis will examine the problems surrounding the manipulation of very large data sets within a Grid-like environment The goal is the development of a prototype system that will enable both effective and efficient access to very large datasets, based on the use of a caching model.

APA, Harvard, Vancouver, ISO, and other styles

35

Kim, Hyeyoen. "Large data sets and nonlinearity : essays in international finance and macroeconomics." Thesis, University of Warwick, 2009. http://wrap.warwick.ac.uk/3747/.

Full text

Abstract:

This thesis has aimed to investigate whether the information in large macroeconomic data sets is relevant for resolving some of puzzling and questionable aspects of international finance and macroeconomics. In particular, we employ the diffusion indices (DIs) analysis in order to capture the very large data sets into a small number of factors. Applications of factors into conventional model specifications address the following main issues. Using factor-augmented vector autoregressive (FAVAR) models, we measure the impact of the UK and US monetary policy. This approach notably mitigates the ‘price puzzle’ for both economies, whereby a monetary tightening appears to have perverse effects on price movements. We also estimate structural FAVARs and examine the impact of aggregate-demand and aggregate-supply using a recursive long-run multiplier identification procedure. This method is applied to examine the evidence for increased UK macroeconomic flexibility following the UK labour market reforms of the 1980s. For forecasting purpose, factors are employed as ‘unobserved’ fundamentals, which direct the movement of exchange rates. From the long-run relationship between factor-based fundamentals and exchange rate, the deviation from the fundamental level of exchange rate is exploited to improve the predictive performance of the fundamental model of exchange rates. Our empirical results suggest that there is strong evidence that factors are helpful to predict the exchange rates as the horizons becomes more elongated, better than random walk and the standard monetary fundamental models. Finally, we explore whether allowing for a wide range of influences on the real exchange rate in a nonlinear framework can help to resolve the ‘PPP puzzle’. Factors, as determinants of the time-varying equilibrium of real exchange rates, are incorporated into a nonlinear framework. Allowing for the effects of macroeconomic factors dramatically increases the measured speed of adjustment of the real exchange rate.

APA, Harvard, Vancouver, ISO, and other styles

36

Mellor, John Phillip 1965. "Automatically recovering geometry and texture from large sets of calibrated images." Thesis, Massachusetts Institute of Technology, 1999. http://hdl.handle.net/1721.1/87157.

Full text

Abstract:

Thesis (Ph.D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, February 2000.
Includes bibliographical references (p. 129-135).
by J.P. Mellor.
Ph.D.

APA, Harvard, Vancouver, ISO, and other styles

37

Deri, Joya A. "Graph Signal Processing: Structure and Scalability to Massive Data Sets." Research Showcase @ CMU, 2016. http://repository.cmu.edu/dissertations/725.

Full text

Abstract:

Large-scale networks are becoming more prevalent, with applications in healthcare systems, financial networks, social networks, and traffic systems. The detection of normal and abnormal behaviors (signals) in these systems presents a challenging problem. State-of-the-art approaches such as principal component analysis and graph signal processing address this problem using signal projections onto a space determined by an eigendecomposition or singular value decomposition. When a graph is directed, however, applying methods based on the graph Laplacian or singular value decomposition causes information from unidirectional edges to be lost. Here we present a novel formulation and graph signal processing framework that addresses this issue and that is well suited for application to extremely large, directed, sparse networks. In this thesis, we develop and demonstrate a graph Fourier transform for which the spectral components are the Jordan subspaces of the adjacency matrix. In addition to admitting a generalized Parseval’s identity, this transform yields graph equivalence classes that can simplify the computation of the graph Fourier transform over certain networks. Exploration of these equivalence classes provides the intuition for an inexact graph Fourier transform method that dramatically reduces computation time over real-world networks with nontrivial Jordan subspaces. We apply our inexact method to four years of New York City taxi trajectories (61 GB after preprocessing) over the NYC road network (6,400 nodes, 14,000 directed edges). We discuss optimization strategies that reduce the computation time of taxi trajectories from raw data by orders of magnitude: from 3,000 days to less than one day. Our method yields a fine-grained analysis that pinpoints the same locations as the original method while reducing computation time and decreasing energy dispersal among spectral components. This capability to rapidly reduce raw traffic data to meaningful features has important ramifications for city planning and emergency vehicle routing.

APA, Harvard, Vancouver, ISO, and other styles

38

Towfeek, Ajden. "Multi-Resolution Volume Rendering of Large Medical Data Sets on the GPU." Thesis, Linköping University, Department of Science and Technology, 2008. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-10715.

Full text

Abstract:

Volume rendering techniques can be powerful tools when visualizing medical data sets. The characteristics of being able to capture 3-D internal structures make the technique attractive. Scanning equipment is producing medical images, with rapidly increasing resolution, resulting in heavily increased size of the data set. Despite the great amount of processing power CPUs deliver, the required precision in image quality can be hard to obtain in real-time rendering. Therefore, it is highly desirable to optimize the rendering process.

Modern GPUs possess much more computational power and is available for general purpose programming through high level shading languages. Efficient representations of the data are crucial due to the limited memory provided by the GPU. This thesis describes the theoretical background and the implementation of an approach presented by Patric Ljung, Claes Lundström and Anders Ynnerman at Linköping University. The main objective is to implement a fully working multi-resolution framework with two separate pipelines for pre-processing and real-time rendering, which uses the GPU to visualize large medical data sets.

APA, Harvard, Vancouver, ISO, and other styles

39

González, David Muñoz. "Discovering unknown equations that describe large data sets using genetic programming techniques." Thesis, Linköping University, Department of Electrical Engineering, 2005. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-2639.

Full text

Abstract:

FIR filters are widely used nowadays, with applications from MP3 players, Hi-Fi systems, digital TVs, etc. to communication systems like wireless communication. They are implemented in DSPs and there are several trade-offs that make important to have an exact as possible estimation of the required filter order.

In order to find a better estimation of the filter order than the existing ones, genetic expression programming (GEP) is used. GEP is a Genetic Algorithm that can be used in function finding. It is implemented in a commercial application which, after the appropriate input file and settings have been provided, performs the evolution of the individuals in the input file so that a good solution is found. The thesis is the first one in this new research line.

The aim has been not only reaching the desired estimation but also pave the way for further investigations.

APA, Harvard, Vancouver, ISO, and other styles

40

Bäckström, Daniel. "Managing and Exploring Large Data Sets Generated by Liquid Separation - Mass Spectrometry." Doctoral thesis, Uppsala University, Analytical Chemistry, 2007. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-8223.

Full text

Abstract:

A trend in natural science and especially in analytical chemistry is the increasing need for analysis of a large number of complex samples with low analyte concentrations. Biological samples (urine, blood, plasma, cerebral spinal fluid, tissue etc.) are often suitable for analysis with liquid separation mass spectrometry (LS-MS), resulting in two-way data tables (time vs. m/z). Such biological 'fingerprints' taken for all samples in a study correspond to a large amount of data. Detailed characterization requires a high sampling rate in combination with high mass resolution and wide mass range, which presents a challenge in data handling and exploration. This thesis describes methods for managing and exploring large data sets made up of such detailed 'fingerprints' (represented as data matrices).

The methods were implemented as scripts and functions in Matlab, a wide-spread environment for matrix manipulations. A single-file structure to hold the imported data facilitated both easy access and fast manipulation. Routines for baseline removal and noise reduction were intended to reduce the amount of data without loosing relevant information. A tool for visualizing and exploring single runs was also included. When comparing two or more 'fingerprints' they usually have to be aligned due to unintended shifts in analyte positions in time and m/z. A PCA-like multivariate method proved to be less sensitive to such shifts, and an ANOVA implementation made it easier to find systematic differences within the data sets.

The above strategies and methods were applied to complex samples such as plasma, protein digests, and urine. The field of application included urine profiling (paracetamole intake; beverage effects), peptide mapping (different digestion protocols) and search for potential biomarkers (appendicitis diagnosis) . The influence of the experimental factors was visualized by PCA score plots as well as clustering diagrams (dendrograms).

APA, Harvard, Vancouver, ISO, and other styles

41

Cutchin, Andrew E. Donahoo Michael J. "Towards efficient and practical reliable bulk data transport for large receiver sets." Waco, Tex. : Baylor University, 2007. http://hdl.handle.net/2104/5140.

Full text

APA, Harvard, Vancouver, ISO, and other styles

42

Harrington, Justin. "Extending linear grouping analysis and robust estimators for very large data sets." Thesis, University of British Columbia, 2008. http://hdl.handle.net/2429/845.

Full text

Abstract:

Cluster analysis is the study of how to partition data into homogeneous subsets so that the partitioned data share some common characteristic. In one to three dimensions, the human eye can distinguish well between clusters of data if clearly separated. However, when there are more than three dimensions and/or the data is not clearly separated, an algorithm is required which needs a metric of similarity that quantitatively measures the characteristic of interest. Linear Grouping Analysis (LGA, Van Aelst et al. 2006) is an algorithm for clustering data around hyperplanes, and is most appropriate when: 1) the variables are related/correlated, which results in clusters with an approximately linear structure; and 2) it is not natural to assume that one variable is a “response”, and the remainder the “explanatories”. LGA measures the compactness within each cluster via the sum of squared orthogonal distances to hyperplanes formed from the data. In this dissertation, we extend the scope of problems to which LGA can be applied. The first extension relates to the linearity requirement inherent within LGA, and proposes a new method of non-linearly transforming the data into a Feature Space, using the Kernel Trick, such that in this space the data might then form linear clusters. A possible side effect of this transformation is that the dimension of the transformed space is significantly larger than the number of observations in a given cluster, which causes problems with orthogonal regression. Therefore, we also introduce a new method for calculating the distance of an observation to a cluster when its covariance matrix is rank deficient. The second extension concerns the combinatorial problem for optimizing a LGA objective function, and adapts an existing algorithm, called BIRCH, for use in providing fast, approximate solutions, particularly for the case when data does not fit in memory. We also provide solutions based on BIRCH for two other challenging optimization problems in the field of robust statistics, and demonstrate, via simulation study as well as application on actual data sets, that the BIRCH solution compares favourably to the existing state-of-the-art alternatives, and in many cases finds a more optimal solution.

APA, Harvard, Vancouver, ISO, and other styles

43

Kışınbay, Turgut. "Predictive ability or data snopping? : essays on forecasting with large data sets." Thesis, McGill University, 2004. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=85018.

Full text

Abstract:

This thesis examines the predictive ability of models for forecasting inflation and financial market volatility. Emphasis is put on evaluation of forecasts and the usage of large data sets. Variety of models are used to forecast inflation, including diffusion indices, artificial neural networks, and traditional linear regressions. Financial market volatility is forecast using various GARCH-type and high-frequency based models. High-frequency data are also used to obtain ex-post estimates of volatility, which is then used to evaluate forecasts. All forecast are evaluated using recently proposed techniques that can account for data snooping bias, nested, and nonlinear models.

APA, Harvard, Vancouver, ISO, and other styles

44

Dutta, Soumya. "In Situ Summarization and Visual Exploration of Large-scale Simulation Data Sets." The Ohio State University, 2018. http://rave.ohiolink.edu/etdc/view?acc_num=osu1524070976058567.

Full text

APA, Harvard, Vancouver, ISO, and other styles

45

Blanc, Trevor Jon. "Analysis and Compression of Large CFD Data Sets Using Proper Orthogonal Decomposition." BYU ScholarsArchive, 2014. https://scholarsarchive.byu.edu/etd/5303.

Full text

Abstract:

Efficient analysis and storage of data is an integral but often challenging task when working with computation fluid dynamics mainly due to the amount of data it can output. Methods centered around the proper orthogonal decomposition were used to analyze, compress, and model various simulation cases. Two different high-fidelity, time-accurate turbomachinery simulations were investigated to show various applications of the analysis techniques. The first turbomachinery example was used to illustrate the extraction of turbulent coherent structures such as traversing shocks, vortex shedding, and wake variation from deswirler and rotor blade passages. Using only the most dominant modes, flow fields were reconstructed and analyzed for error. The reconstructions reproduced the general dynamics within the flow well, but failed to fully resolve shock fronts and smaller vortices. By decomposing the domain into smaller, independent pieces, reconstruction error was reduced by up to 63 percent. A new method of data compression that combined an image compression algorithm and the proper orthogonal decomposition was used to store the reconstructions of the flow field, increasing data compression ratios by a factor of 40.The second turbomachinery simulation studied was a three-stage fan with inlet total pressure distortion. Both the snapshot and repeating geometry methods were used to characterize structures of static pressure fluctuation within the blade passages of the third rotor blade row. Modal coefficients filtered by frequencies relating to the inlet distortion pattern were used to produce reconstructions of the pressure field solely dependent on the inlet boundary condition. A hybrid proper orthogonal decomposition method was proposed to limit burdens on computational resources while providing high temporal resolution analysis.Parametric reduced order models were created from large databases of transient and steady conjugate heat transfer and airfoil simulations. Performance of the models were found to depend heavily on the range of the parameters varied as well as the number of simulations used to traverse that range. The heat transfer models gave excellent predictions for temperature profiles in heated solids for ambitious parameter ranges. Model development for the airfoil case showed that accuracy was highly dependent on modal truncation. The flow fields were predicted very well, especially outside the boundary layer region of the flow.

APA, Harvard, Vancouver, ISO, and other styles

46

Nguyen, Minh Quoc. "Toward accurate and efficient outlier detection in high dimensional and large data sets." Diss., Georgia Institute of Technology, 2010. http://hdl.handle.net/1853/34657.

Full text

Abstract:

An efficient method to compute local density-based outliers in high dimensional data was proposed. In our work, we have shown that this type of outlier is present even in any subset of the dataset. This property is used to partition the data set into random subsets to compute the outliers locally. The outliers are then combined from different subsets. Therefore, the local density-based outliers can be computed efficiently. Another challenge in outlier detection in high dimensional data is that the outliers are often suppressed when the majority of dimensions do not exhibit outliers. The contribution of this work is to introduce a filtering method whereby outlier scores are computed in sub-dimensions. The low sub-dimensional scores are filtered out and the high scores are aggregated into the final score. This aggregation with filtering eliminates the effect of accumulating delta deviations in multiple dimensions. Therefore, the outliers are identified correctly. In some cases, the set of outliers that form micro patterns are more interesting than individual outliers. These micro patterns are considered anomalous with respect to the dominant patterns in the dataset. In the area of anomalous pattern detection, there are two challenges. The first challenge is that the anomalous patterns are often overlooked by the dominant patterns using the existing clustering techniques. A common approach is to cluster the dataset using the k-nearest neighbor algorithm. The contribution of this work is to introduce the adaptive nearest neighbor and the concept of dual-neighbor to detect micro patterns more accurately. The next challenge is to compute the anomalous patterns very fast. Our contribution is to compute the patterns based on the correlation between the attributes. The correlation implies that the data can be partitioned into groups based on each attribute to learn the candidate patterns within the groups. Thus, a feature-based method is developed that can compute these patterns efficiently.

APA, Harvard, Vancouver, ISO, and other styles

47

Bresell, Anders. "Characterization of protein families, sequence patterns, and functional annotations in large data sets." Doctoral thesis, Linköping : Department of Physics, Chemistry and Biology, 2008. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-10565.

Full text

APA, Harvard, Vancouver, ISO, and other styles

48

Castro, Jose R. "MODIFICATIONS TO THE FUZZY-ARTMAP ALGORITHM FOR DISTRIBUTED LEARNING IN LARGE DATA SETS." Doctoral diss., University of Central Florida, 2004. http://digital.library.ucf.edu/cdm/ref/collection/ETD/id/4449.

Full text

Abstract:

The Fuzzy-ARTMAP (FAM) algorithm is one of the premier neural network architectures for classification problems. FAM can learn on line and is usually faster than other neural network approaches. Nevertheless the learning time of FAM can slow down considerably when the size of the training set increases into the hundreds of thousands. In this dissertation we apply data partitioning and network partitioning to the FAM algorithm in a sequential and parallel setting to achieve better convergence time and to efficiently train with large databases (hundreds of thousands of patterns). We implement our parallelization on a Beowulf clusters of workstations. This choice of platform requires that the process of parallelization be coarse grained. Extensive testing of all the approaches is done on three large datasets (half a million data points). One of them is the Forest Covertype database from Blackard and the other two are artificially generated Gaussian data with different percentages of overlap between classes. Speedups in the data partitioning approach reached the order of the hundreds without having to invest in parallel computation. Speedups on the network partitioning approach are close to linear on a cluster of workstations. Both methods allowed us to reduce the computation time of training the neural network in large databases from days to minutes. We prove formally that the workload balance of our network partitioning approaches will never be worse than an acceptable bound, and also demonstrate the correctness of these parallelization variants of FAM.
Ph.D.
School of Electrical and Computer Engineering
Engineering and Computer Science
Electrical and Computer Engineering

APA, Harvard, Vancouver, ISO, and other styles

49

Brind'Amour, Katherine. "Maternal and Child Health Home Visiting Evaluations Using Large, Pre-Existing Data Sets." The Ohio State University, 2016. http://rave.ohiolink.edu/etdc/view?acc_num=osu1468965739.

Full text

APA, Harvard, Vancouver, ISO, and other styles

50

Quiroz, Matias. "Bayesian Inference in Large Data Problems." Doctoral thesis, Stockholms universitet, Statistiska institutionen, 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:su:diva-118836.

Full text

Abstract:

In the last decade or so, there has been a dramatic increase in storage facilities and the possibility of processing huge amounts of data. This has made large high-quality data sets widely accessible for practitioners. This technology innovation seriously challenges traditional modeling and inference methodology. This thesis is devoted to developing inference and modeling tools to handle large data sets. Four included papers treat various important aspects of this topic, with a special emphasis on Bayesian inference by scalable Markov Chain Monte Carlo (MCMC) methods. In the first paper, we propose a novel mixture-of-experts model for longitudinal data. The model and inference methodology allows for manageable computations with a large number of subjects. The model dramatically improves the out-of-sample predictive density forecasts compared to existing models. The second paper aims at developing a scalable MCMC algorithm. Ideas from the survey sampling literature are used to estimate the likelihood on a random subset of data. The likelihood estimate is used within the pseudomarginal MCMC framework and we develop a theoretical framework for such algorithms based on subsets of the data. The third paper further develops the ideas introduced in the second paper. We introduce the difference estimator in this framework and modify the methods for estimating the likelihood on a random subset of data. This results in scalable inference for a wider class of models. Finally, the fourth paper brings the survey sampling tools for estimating the likelihood developed in the thesis into the delayed acceptance MCMC framework. We compare to an existing approach in the literature and document promising results for our algorithm.

At the time of the doctoral defense, the following papers were unpublished and had a status as follows: Paper 1: Submitted. Paper 2: Submitted. Paper 3: Manuscript. Paper 4: Manuscript.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Large sets'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles