To see the other types of publications on this topic, follow the link: Datasety.

Dissertations / Theses on the topic 'Datasety'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 dissertations / theses for your research on the topic 'Datasety.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Zembjaková, Martina. "Prieskum a taxonómia sieťových forenzných nástrojov." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2021. http://www.nusl.cz/ntk/nusl-445488.

Full text
Abstract:
Táto diplomová práca sa zaoberá prieskumom a taxonómiou sieťových forenzných nástrojov. Popisuje základné informácie o sieťovej forenznej analýze, vrátane procesných modelov, techník a zdrojov dát používaných pri forenznej analýze. Ďalej práca obsahuje prieskum existujúcich taxonómií sieťových forenzných nástrojov vrátane ich porovnania, na ktorý naväzuje prieskum sieťových forenzných nástrojov. Diskutované sieťové nástroje obsahujú okrem nástrojov spomenutých v prieskume taxonómií aj niektoré ďalšie sieťové nástroje. Následne sú v práci detailne popísané a porovnané datasety, ktoré sú podkladom pre analýzu jednotlivými sieťovými nástrojmi. Podľa získaných informácií z vykonaných prieskumov sú navrhnuté časté prípady použitia a nástroje sú demonštrované v rámci popisu jednotlivých prípadov použitia. Na demonštrovanie nástrojov sú okrem verejne dostupných datasetov použité aj novo vytvorené datasety, ktoré sú detailne popísane vo vlastnej kapitole. Na základe získaných informácií je navrhnutá nová taxonómia, ktorá je založená na prípadoch použitia nástrojov na rozdiel od ostatných taxonómií založených na NFAT a NSM nástrojoch, uživateľskom rozhraní, zachytávaní dát, analýze, či type forenznej analýzy.
APA, Harvard, Vancouver, ISO, and other styles
2

Kratochvíla, Lukáš. "Trasování objektu v reálném čase." Master's thesis, Vysoké učení technické v Brně. Fakulta elektrotechniky a komunikačních technologií, 2019. http://www.nusl.cz/ntk/nusl-403748.

Full text
Abstract:
Sledování obecného objektu na zařízení s omezenými prostředky v reálném čase je obtížné. Mnoho algoritmů věnujících se této problematice již existuje. V této práci se s nimi seznámíme. Různé přístupy k této problematice jsou diskutovány včetně hlubokého učení. Představeny jsou reprezentace objektu, datasety i metriky pro vyhodnocování. Mnoho sledovacích algorimů je představeno, osm z nich je implementováno a vyhodnoceno na VOT datasetu.
APA, Harvard, Vancouver, ISO, and other styles
3

Singh, Manjeet. "A Comparison of Rule Extraction Techniques with Emphasis on Heuristics for Imbalanced Datasets." Ohio University / OhioLINK, 2010. http://rave.ohiolink.edu/etdc/view?acc_num=ohiou1282139633.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Silva, Jesús, Palma Hugo Hernández, Núẽz William Niebles, David Ovallos-Gazabon, and Noel Varela. "Parallel Algorithm for Reduction of Data Processing Time in Big Data." Institute of Physics Publishing, 2020. http://hdl.handle.net/10757/652134.

Full text
Abstract:
Technological advances have allowed to collect and store large volumes of data over the years. Besides, it is significant that today's applications have high performance and can analyze these large datasets effectively. Today, it remains a challenge for data mining to make its algorithms and applications equally efficient in the need of increasing data size and dimensionality [1]. To achieve this goal, many applications rely on parallelism, because it is an area that allows the reduction of cost depending on the execution time of the algorithms because it takes advantage of the characteristics of current computer architectures to run several processes concurrently [2]. This paper proposes a parallel version of the FuzzyPred algorithm based on the amount of data that can be processed within each of the processing threads, synchronously and independently.
APA, Harvard, Vancouver, ISO, and other styles
5

Munyombwe, Theresa. "The harmonisation of stroke datasets : a case study of four UK datasets." Thesis, University of Leeds, 2016. http://etheses.whiterose.ac.uk/13511/.

Full text
Abstract:
Longitudinal studies of stroke patients play a critical part in developing stroke prognostic models. Stroke longitudinal studies are often limited by small sample sizes, poor recruitment, and high attrition levels. Some of these limitations can be addressed by harmonising and pooling data from existing studies. Thus this thesis evaluated the feasibility of harmonising and pooling secondary stroke datasets to investigate the factors associated with disability after stroke. Data from the Clinical Information Management System for Stroke study (n=312), Stroke Outcome Study 1(n=448), Stroke Outcome Study 2 (n=585), and the Leeds Sentinel Stroke National Audit (n=350) were used in this research. The research conducted in this thesis consisted of four stages. The first stage used the Data Schema and Harmonisation Platform for Epidemiological Research (DataSHaPER) approach to evaluate the feasibility of harmonising and pooling the four datasets that were used in this case study. The second stage evaluated the utility of using multi-group-confirmatory-factor analysis for testing measurement invariance of the GHQ-28 measure prior to pooling the datasets. The third stage evaluated the utility of using Item Response Theory (IRT) models and regression- based methods for linking disability outcome measures. The last stage synthesised the harmonised datasets using multi-group latent class analysis and multi-level Poisson models to investigate the factors associated with disability post-stroke. The main barrier encountered in pooling the four datasets was the heterogeneity in outcome measures. Pooling datasets was beneficial but there was a trade-off between increasing the sample size and losing important covariates. The findings from this present study suggested that the GHQ-28 measure was invariant across the SOS1 and SOS2 stroke cohorts, thus an integrative data analysis of the two SOS datasets was conducted. Harmonising measurement scales using IRT models and regression-based methods was effective for predicting group averages and not individual patient predictions. The analyses of harmonised datasets suggested an association of female gender with anxiety and depressive symptoms post-stroke. This research concludes that harmonising and pooling data from multiple stroke studies was beneficial but there were challenges in measurement comparability. Continued efforts should be made to develop a Data Schema for stroke to facilitate data sharing in stroke rehabilitation research.
APA, Harvard, Vancouver, ISO, and other styles
6

Furman, Yoel Avraham. "Forecasting with large datasets." Thesis, University of Oxford, 2014. http://ora.ox.ac.uk/objects/uuid:69f2833b-cc53-457a-8426-37c06df85bc2.

Full text
Abstract:
This thesis analyzes estimation methods and testing procedures for handling large data series. The first chapter introduces the use of the adaptive elastic net, and the penalized regression methods nested within it, for estimating sparse vector autoregressions. That chapter shows that under suitable conditions on the data generating process this estimation method satisfies an oracle property. Furthermore, it is shown that the bootstrap can be used to accurately conduct inference on the estimated parameters. These properties are used to show that structural VAR analysis can also be validly conducted, allowing for accurate measures of policy response. The strength of these estimation methods is demonstrated in a numerical study and on U.S. macroeconomic data. The second chapter continues in a similar vein, using the elastic net to estimate sparse vector autoregressions of realized variances to construct volatility forecasts. It is shown that the use of volatility spillovers estimated by the elastic net delivers substantial improvements in forecast ability, and can be used to indicate systemic risk among a group of assets. The model is estimated on realized variances of equities of U.S. financial institutions, where it is shown that the estimated parameters translate into two novel indicators of systemic risk. The third chapter discusses the use of the bootstrap as an alternative to asymptotic Wald-type tests. It is shown that the bootstrap is particularly useful in situations with many restrictions, such as tests of equal conditional predictive ability that make use of many orthogonal variables, or `test functions'. The testing procedure is analyzed in a Monte Carlo study and is used to test the relevance of real variables in forecasting U.S. inflation.
APA, Harvard, Vancouver, ISO, and other styles
7

Mumtaz, Shahzad. "Visualisation of bioinformatics datasets." Thesis, Aston University, 2015. http://publications.aston.ac.uk/25261/.

Full text
Abstract:
Analysing the molecular polymorphism and interactions of DNA, RNA and proteins is of fundamental importance in biology. Predicting functions of polymorphic molecules is important in order to design more effective medicines. Analysing major histocompatibility complex (MHC) polymorphism is important for mate choice, epitope-based vaccine design and transplantation rejection etc. Most of the existing exploratory approaches cannot analyse these datasets because of the large number of molecules with a high number of descriptors per molecule. This thesis develops novel methods for data projection in order to explore high dimensional biological dataset by visualising them in a low-dimensional space. With increasing dimensionality, some existing data visualisation methods such as generative topographic mapping (GTM) become computationally intractable. We propose variants of these methods, where we use log-transformations at certain steps of expectation maximisation (EM) based parameter learning process, to make them tractable for high-dimensional datasets. We demonstrate these proposed variants both for synthetic and electrostatic potential dataset of MHC class-I. We also propose to extend a latent trait model (LTM), suitable for visualising high dimensional discrete data, to simultaneously estimate feature saliency as an integrated part of the parameter learning process of a visualisation model. This LTM variant not only gives better visualisation by modifying the project map based on feature relevance, but also helps users to assess the significance of each feature. Another problem which is not addressed much in the literature is the visualisation of mixed-type data. We propose to combine GTM and LTM in a principled way where appropriate noise models are used for each type of data in order to visualise mixed-type data in a single plot. We call this model a generalised GTM (GGTM). We also propose to extend GGTM model to estimate feature saliencies while training a visualisation model and this is called GGTM with feature saliency (GGTM-FS). We demonstrate effectiveness of these proposed models both for synthetic and real datasets. We evaluate visualisation quality using quality metrics such as distance distortion measure and rank based measures: trustworthiness, continuity, mean relative rank errors with respect to data space and latent space. In cases where the labels are known we also use quality metrics of KL divergence and nearest neighbour classifications error in order to determine the separation between classes. We demonstrate the efficacy of these proposed models both for synthetic and real biological datasets with a main focus on the MHC class-I dataset.
APA, Harvard, Vancouver, ISO, and other styles
8

Mazumdar, Suvodeep. "Visualising large semantic datasets." Thesis, University of Sheffield, 2013. http://etheses.whiterose.ac.uk/5932/.

Full text
Abstract:
This thesis aims at addressing a major issue in Semantic Web and organisational Knowledge Management: consuming large scale semantic data in a generic, scalable and pleasing manner. It proposes two solutions by de-constructing the issue into two sub problems: how can large semantic result sets be presented to users; and how can large semantic datasets be explored and queried. The first proposed solution is a dashboard-based multi-visualisation approach to present simultaneous views over different facets of the data. Challenges imposed by existing technology infrastructure resulted in the development of a set of design guidelines. These guidelines and lessons learnt from the development of the approach is the first contribution of this thesis. The next stage of research initiated with the formulation of design principles from aesthetic design, Visual Analytics and Semantic Web principles derived from the literature. These principles provide guidelines to developers for building generic visualisation solutions for large scale semantic data and constitute the next contribution of the thesis. The second proposed solution is an interactive node-link visualisation approach that presents semantic concepts and their relations enriched with statistics of the underlying data. This solution was developed with an explicit attention to the proposed design principles. The two solutions exploit basic rules and templates to translate low level user interactions into high level intents, and subsequently into formal queries in a generic manner. These translation rules and templates that enable generic exploration of large scale semantic data constitute the third contribution of the thesis. An iterative User-Centered Design methodology, with the active participation of nearly a hundred users including knowledge workers, managers, engineers, researchers and students over the duration of the research was employed to develop both solutions. The fourth contribution of this thesis is an argument for the continued active participation and involvement of all user communities to ensure the development of a highly effective, intuitive and appreciated solution.
APA, Harvard, Vancouver, ISO, and other styles
9

De, León Eduardo Enrique. "Medical abstract inference dataset." Thesis, Massachusetts Institute of Technology, 2017. http://hdl.handle.net/1721.1/119516.

Full text
Abstract:
Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2017.
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Cataloged from student-submitted PDF version of thesis.
Includes bibliographical references (page 35).
In this thesis, I built a dataset for predicting clinical outcomes from medical abstracts and their title. Medical Abstract Inference consists of 1,794 data points. Titles were filtered to include the abstract's reported medical intervention and clinical outcome. Data points were annotated with the interventions effect on the outcome. Resulting labels were one of the following: increased, decreased, or had no significant difference on the outcome. In addition, rationale sentences were marked, these sentences supply the necessary supporting evidence for the overall prediction. Preliminary modeling was also done to evaluate the corpus. Preliminary models included top performing Natural Language Inference models as well as Rationale based models and linear classifiers.
by Eduardo Enrique de León.
M. Eng.
APA, Harvard, Vancouver, ISO, and other styles
10

Schöner, Holger. "Working with real world datasets preprocessing and prediction with large incomplete and heterogeneous datasets /." [S.l.] : [s.n.], 2005. http://deposit.ddb.de/cgi-bin/dokserv?idn=973424672.

Full text
APA, Harvard, Vancouver, ISO, and other styles
11

Gemulla, Rainer. "Sampling Algorithms for Evolving Datasets." Doctoral thesis, Saechsische Landesbibliothek- Staats- und Universitaetsbibliothek Dresden, 2008. http://nbn-resolving.de/urn:nbn:de:bsz:14-ds-1224861856184-11644.

Full text
Abstract:
Perhaps the most flexible synopsis of a database is a uniform random sample of the data; such samples are widely used to speed up the processing of analytic queries and data-mining tasks, to enhance query optimization, and to facilitate information integration. Most of the existing work on database sampling focuses on how to create or exploit a random sample of a static database, that is, a database that does not change over time. The assumption of a static database, however, severely limits the applicability of these techniques in practice, where data is often not static but continuously evolving. In order to maintain the statistical validity of the sample, any changes to the database have to be appropriately reflected in the sample. In this thesis, we study efficient methods for incrementally maintaining a uniform random sample of the items in a dataset in the presence of an arbitrary sequence of insertions, updates, and deletions. We consider instances of the maintenance problem that arise when sampling from an evolving set, from an evolving multiset, from the distinct items in an evolving multiset, or from a sliding window over a data stream. Our algorithms completely avoid any accesses to the base data and can be several orders of magnitude faster than algorithms that do rely on such expensive accesses. The improved efficiency of our algorithms comes at virtually no cost: the resulting samples are provably uniform and only a small amount of auxiliary information is associated with the sample. We show that the auxiliary information not only facilitates efficient maintenance, but it can also be exploited to derive unbiased, low-variance estimators for counts, sums, averages, and the number of distinct items in the underlying dataset. In addition to sample maintenance, we discuss methods that greatly improve the flexibility of random sampling from a system's point of view. More specifically, we initiate the study of algorithms that resize a random sample upwards or downwards. Our resizing algorithms can be exploited to dynamically control the size of the sample when the dataset grows or shrinks; they facilitate resource management and help to avoid under- or oversized samples. Furthermore, in large-scale databases with data being distributed across several remote locations, it is usually infeasible to reconstruct the entire dataset for the purpose of sampling. To address this problem, we provide efficient algorithms that directly combine the local samples maintained at each location into a sample of the global dataset. We also consider a more general problem, where the global dataset is defined as an arbitrary set or multiset expression involving the local datasets, and provide efficient solutions based on hashing.
APA, Harvard, Vancouver, ISO, and other styles
12

Schmidt, Heiko A. "Phylogenetic trees from large datasets." [S.l. : s.n.], 2003. http://deposit.ddb.de/cgi-bin/dokserv?idn=968534945.

Full text
APA, Harvard, Vancouver, ISO, and other styles
13

Jones, Martin. "Multigene datasets for deep phylogeny." Thesis, University of Edinburgh, 2007. http://hdl.handle.net/1842/2575.

Full text
Abstract:
Though molecular phylogenetics has been very successful in reconstructing the evolutionary history of species, some phylogenies, particularly those involving ancient events, have proven difficult to resolve. One approach to improving the resolution of deep phylogenies is to increase the amount of data by including multiple genes assembled from public sequence databases. Using modern phylogenetic methods and abundant computing power, the vast amount of sequence data available in public databases can be brought to bear on difficult phylogenetic problems. In this thesis I outline the motivation for assembling large multigene datasets and lay out the obstacles associated with doing so. I discuss the various methods by which these obstacles can be overcome and describe a bioinformatics solution, TaxMan, that can be used to rapidly assemble very large datasets of aligned genes in a largely automated fashion. I also explain the design and features of TaxMan from a biological standpoint and present the results of benchmarking studies. I illustrate the use of TaxMan to assemble large multigene datasets for two groups of taxa – the subphylum Chelicerata and the superphylum Lophotrochozoa. Chelicerata is a diverse group of arthropods with an uncertain phylogeny. When a set of mitochondrial genes is used to analyse the relationships between the chelicerate orders, the conclusions are highly dependent upon the evolutionary model used and are affected by the presence of systematic compsitional bias in mitochondrial genomes. Lophotrochozoa is a recently-proposed group of protostome phyla. A number of distinct phylogenetic hypotheses concerning the relationships between lophotrochozoan phyla have been proposed. I compare the phylogenetic conclusions given by analysis of nuclear and mitochondrial protein-coding and rRNA genes to evaluate support for some of these hypotheses.
APA, Harvard, Vancouver, ISO, and other styles
14

Traore, Michael. "Interactive visualization for volumetric datasets." Thesis, Toulouse, ISAE, 2018. http://www.theses.fr/2018ESAE0028.

Full text
Abstract:
L’occlusion est un problème dans la visualisation volumétrique car elle empêche lavisualisation directe d’une région d’intérêt. Alors que la plupart des systèmes existantsutilisent une combinaison de techniques de rendu en volume direct (DVR) et deleur fonction de transfert (TF) correspondante, nous avons envisagé des techniquesd’interaction alternatives pour explorer ces ensembles de données.Tout d’abord, nous avons proposé un nouveau système de visualisation interactivepour les bagages numérisés en 3D, accéléré par les techniques GPGPU, conformémentaux besoins que nous avons extraits de l’enquête contextuelle auprès desagents de sécurité de l’aéroport.Deuxièmement, nous avons proposé une nouvelle technique qui associe unrendu volumétrique de haute qualité à une lentille rapide, polyvalente et facile àutiliser pour prendre en charge l’exploration interactive des données occluses dansdes volumes
Occlusion is an issue in volumetric visualization as it prevents direct visualizationof the region of interest. While most existing systems use a combination of DirectVolume Rendering (DVR) technique and its corresponding Transfer Function (TF),we considered alternative interaction techniques to explore such datasets.First, we proposed a new interactive visualization system for 3D scanned baggageaccelerated with GPGPU techniques in accordance with the needs we extractedfrom the contextual inquiry with the airport security agents.Secondly, we proposed a novel technique which combines high-quality DVRwith a fast, versatile, and easy to use, lens to support the interactive explorationof occluded data in volumes
APA, Harvard, Vancouver, ISO, and other styles
15

Giritharan, Balathasan. "Incremental Learning with Large Datasets." Thesis, University of North Texas, 2012. https://digital.library.unt.edu/ark:/67531/metadc149595/.

Full text
Abstract:
This dissertation focuses on the novel learning strategy based on geometric support vector machines to address the difficulties of processing immense data set. Support vector machines find the hyper-plane that maximizes the margin between two classes, and the decision boundary is represented with a few training samples it becomes a favorable choice for incremental learning. The dissertation presents a novel method Geometric Incremental Support Vector Machines (GISVMs) to address both efficiency and accuracy issues in handling massive data sets. In GISVM, skin of convex hulls is defined and an efficient method is designed to find the best skin approximation given available examples. The set of extreme points are found by recursively searching along the direction defined by a pair of known extreme points. By identifying the skin of the convex hulls, the incremental learning will only employ a much smaller number of samples with comparable or even better accuracy. When additional samples are provided, they will be used together with the skin of the convex hull constructed from previous dataset. This results in a small number of instances used in incremental steps of the training process. Based on the experimental results with synthetic data sets, public benchmark data sets from UCI and endoscopy videos, it is evident that the GISVM achieved satisfactory classifiers that closely model the underlying data distribution. GISVM improves the performance in sensitivity in the incremental steps, significantly reduced the demand for memory space, and demonstrates the ability of recovery from temporary performance degradation.
APA, Harvard, Vancouver, ISO, and other styles
16

Barnathan, Michael. "Mining Complex High-Order Datasets." Diss., Temple University Libraries, 2010. http://cdm16002.contentdm.oclc.org/cdm/ref/collection/p245801coll10/id/82058.

Full text
Abstract:
Computer and Information Science
Ph.D.
Selection of an appropriate structure for storage and analysis of complex datasets is a vital but often overlooked decision in the design of data mining and machine learning experiments. Most present techniques impose a matrix structure on the dataset, with rows representing observations and columns representing features. While this assumption is reasonable when features are scalar and do not exhibit co-dependence, the matrix data model becomes inappropriate when dependencies between non-target features must be modeled in parallel, or when features naturally take the form of higher-order multilinear structures. Such datasets particularly abound in functional medical imaging modalities, such as fMRI, where accurate integration of both spatial and temporal information is critical. Although necessary to take full advantage of the high-order structure of these datasets and built on well-studied mathematical tools, tensor analysis methodologies have only recently entered widespread use in the data mining community and remain relatively absent from the literature within the biomedical domain. Furthermore, naive tensor approaches suffer from fundamental efficiency problems which limit their practical use in large-scale high-order mining and do not capture local neighborhoods necessary for accurate spatiotemporal analysis. To address these issues, a comprehensive framework based on wavelet analysis, tensor decomposition, and the WaveCluster algorithm is proposed for addressing the problems of preprocessing, classification, clustering, compression, feature extraction, and latent concept discovery on large-scale high-order datasets, with a particular emphasis on applications in computer-assisted diagnosis. Our framework is evaluated on a 9.3 GB fMRI motor task dataset of both high dimensionality and high order, performing favorably against traditional voxelwise and spectral methods of analysis, discovering latent concepts suggestive of subject handedness, and reducing space and time complexities by up to two orders of magnitude. Novel wavelet and tensor tools are derived in the course of this work, including a novel formulation of an r-dimensional wavelet transform in terms of elementary tensor operations and an enhanced WaveCluster algorithm capable of clustering real-valued as well as binary data. Sparseness-exploiting properties are demonstrated and variations of core algorithms for specialized tasks such as image segmentation are presented.
Temple University--Theses
APA, Harvard, Vancouver, ISO, and other styles
17

Alawini, Abdussalam. "Identifying Relationships between Scientific Datasets." Thesis, Portland State University, 2016. http://pqdtopen.proquest.com/#viewpdf?dispub=10127966.

Full text
Abstract:

Scientific datasets associated with a research project can proliferate over time as a result of activities such as sharing datasets among collaborators, extending existing datasets with new measurements, and extracting subsets of data for analysis. As such datasets begin to accumulate, it becomes increasingly difficult for a scientist to keep track of their derivation history, which complicates data sharing, provenance tracking, and scientific reproducibility. Understanding what relationships exist between datasets can help scientists recall their original derivation history. For instance, if dataset A is contained in dataset B, then the connection between A and B could be that A was extended to create B.

We present a relationship-identification methodology as a solution to this problem. To examine the feasibility of our approach, we articulated a set of relevant relationships, developed algorithms for efficient discovery of these relationships, and organized these algorithms into a new system called ReConnect to assist scientists in relationship discovery. We also evaluated existing alternative approaches that rely on flagging differences between two spreadsheets and found that they were impractical for many relationship-discovery tasks. Additionally, we conducted a user study, which showed that relationships do occur in real-world spreadsheets, and that ReConnect can improve scientists' ability to detect such relationships between datasets.

The promising results of ReConnect's evaluation encouraged us to explore a more automated approach for relationship discovery. In this dissertation, we introduce an automated end-to-end prototype system, ReDiscover, that identifies, from a collection of datasets, the pairs that are most likely related, and the relationship between them. Our experimental results demonstrate the overall effectiveness of ReDiscover in predicting relationships in a scientist's or a small group of researchers' collections of datasets, and the sensitivity of the overall system to the performance of its various components.

APA, Harvard, Vancouver, ISO, and other styles
18

Jayaraman, Jayakumar. "Dental age assessment of Southern Chinese using Demirjian's dataset and the United Kingdom dataset." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2010. http://hub.hku.hk/bib/B45447767.

Full text
APA, Harvard, Vancouver, ISO, and other styles
19

Horečný, Peter. "Metody segmentace obrazu s malými trénovacími množinami." Master's thesis, Vysoké učení technické v Brně. Fakulta elektrotechniky a komunikačních technologií, 2020. http://www.nusl.cz/ntk/nusl-412996.

Full text
Abstract:
The goal of this thesis was to propose an image segmentation method, which is capable of effective segmentation process with small datasets. Recently published ODE neural network was used for this method, because its features should provide better generalization in case of tasks with only small datasets available. The proposed ODE-UNet network was created by combining UNet architecture with ODE neural network, while using benefits of both networks. ODE-UNet reached following results on ISBI dataset: Rand: 0,950272 and Info: 0,978061. These results are better than the ones received from UNet model, which was also tested in this thesis, but it has been proven that state of the art can not be outperformed using ODE neural networks. However, the advantages of ODE neural network over tested UNet architecture and other methods were confirmed, and there is still a room for improvement by extending this method.
APA, Harvard, Vancouver, ISO, and other styles
20

Koufakou, Anna. "SCALABLE AND EFFICIENT OUTLIER DETECTION IN LARGE DISTRIBUTED DATA SETS WITH MIXED-TYPE ATTRIBUTES." Doctoral diss., University of Central Florida, 2009. http://digital.library.ucf.edu/cdm/ref/collection/ETD/id/3431.

Full text
Abstract:
An important problem that appears often when analyzing data involves identifying irregular or abnormal data points called outliers. This problem broadly arises under two scenarios: when outliers are to be removed from the data before analysis, and when useful information or knowledge can be extracted by the outliers themselves. Outlier Detection in the context of the second scenario is a research field that has attracted significant attention in a broad range of useful applications. For example, in credit card transaction data, outliers might indicate potential fraud; in network traffic data, outliers might represent potential intrusion attempts. The basis of deciding if a data point is an outlier is often some measure or notion of dissimilarity between the data point under consideration and the rest. Traditional outlier detection methods assume numerical or ordinal data, and compute pair-wise distances between data points. However, the notion of distance or similarity for categorical data is more difficult to define. Moreover, the size of currently available data sets dictates the need for fast and scalable outlier detection methods, thus precluding distance computations. Additionally, these methods must be applicable to data which might be distributed among different locations. In this work, we propose novel strategies to efficiently deal with large distributed data containing mixed-type attributes. Specifically, we first propose a fast and scalable algorithm for categorical data (AVF), and its parallel version based on MapReduce (MR-AVF). We extend AVF and introduce a fast outlier detection algorithm for large distributed data with mixed-type attributes (ODMAD). Finally, we modify ODMAD in order to deal with very high-dimensional categorical data. Experiments with large real-world and synthetic data show that the proposed methods exhibit large performance gains and high scalability compared to the state-of-the-art, while achieving similar accuracy detection rates.
Ph.D.
School of Electrical Engineering and Computer Science
Engineering and Computer Science
Computer Engineering PhD
APA, Harvard, Vancouver, ISO, and other styles
21

Sysoev, Oleg. "Monotonic regression for large multivariate datasets /." Linköping : Department of Cuputer and Information Science, Linköping University, 2010. http://www2.bibl.liu.se/liupubl/disp/disp2010/stat11s.pdf.

Full text
APA, Harvard, Vancouver, ISO, and other styles
22

Mahmood, Muhammad Habib. "Motion annotation in complex video datasets." Doctoral thesis, Universitat de Girona, 2018. http://hdl.handle.net/10803/667583.

Full text
Abstract:
Motion segmentation refers to the process of separating regions and trajectories from a video sequence into coherent subsets of space and time. In this thesis, we created a new multifaceted motion segmentation dataset enclosing real-life long and short sequences, with different numbers of motions and frames per sequence, and real distortions with missing data. Trajectory- and region-based ground-truth is provided on all the frames of all the sequences. We also proposed a new semi-automatic tool for delineating the trajectories in complex videos, even in videos captured from moving cameras. With a minimal manual annotation of an object mask, the algorithm is able to propagate the label mask in all the frames. Object label correction based on static and moving occluder is performed by applying occluder mask tracking for a given depth ordering. The results show that our cascaded-naive approach provides successful results in a variety of video sequences.
La segmentació del moviment es refereix al procés de separar regions i trajectòries d'una seqüència de vídeo en subconjunts coherents d'espai i de temps. En aquesta tesi hem creat un nou i multifacètic dataset amb seqüències de la vida real que inclou diferent número de moviments i fotogrames per seqüència i distorsions amb dades incomplertes. A més, inclou ground-truth en tots els fotogrames basat en mesures de trajectòria i regió. Hem proposat també una nova eina semiautomàtica per delinear les trajectòries en vídeos complexos, fins i tot en vídeos capturats amb càmeres mòbils. Amb una mínima anotació manual dels objectes, l'algoritme és capaç de propagar-la en tots els fotogrames. Durant les oclusions, la correcció de les etiquetes es realitza aplicant el seguiment de la màscara per a cada ordre de profunditat. Els resultats obtinguts mostren que el nostre enfocament ofereix resultats reeixits en una àmplia varietat de seqüències de vídeo.
APA, Harvard, Vancouver, ISO, and other styles
23

Shi, Xiaojin. "Visual learning from small training datasets /." Diss., Digital Dissertations Database. Restricted to UC campuses, 2005. http://uclibs.org/PID/11984.

Full text
APA, Harvard, Vancouver, ISO, and other styles
24

Cotter, Andrew. "Regression on datasets containing missing elements." Diss., Connect to online resource, 2005. http://wwwlib.umi.com/cr/colorado/fullcit?p1425786.

Full text
APA, Harvard, Vancouver, ISO, and other styles
25

Zhang, Xiaoyu. "Scalable isocontour visualization for large datasets /." Full text (PDF) from UMI/Dissertation Abstracts International, 2001. http://wwwlib.umi.com/cr/utexas/fullcit?p3064695.

Full text
APA, Harvard, Vancouver, ISO, and other styles
26

Yang, Chaozheng. "Sufficient Dimension Reduction in Complex Datasets." Diss., Temple University Libraries, 2016. http://cdm16002.contentdm.oclc.org/cdm/ref/collection/p245801coll10/id/404627.

Full text
Abstract:
Statistics
Ph.D.
This dissertation focuses on two problems in dimension reduction. One is using permutation approach to test predictor contribution. The permutation approach applies to marginal coordinate tests based on dimension reduction methods such as SIR, SAVE and DR. This approach no longer requires calculation of the method-specific weights to determine the asymptotic null distribution. The other one is through combining clustering method with robust regression (least absolute deviation) to estimate dimension reduction subspace. Compared with ordinary least squares, the proposed method is more robust to outliers; also, this method replaces the global linearity assumption with the more flexible local linearity assumption through k-means clustering.
Temple University--Theses
APA, Harvard, Vancouver, ISO, and other styles
27

Blum, Joshua (Joshua M. ). "Pinky : interactively analyzing large EEG datasets." Thesis, Massachusetts Institute of Technology, 2016. http://hdl.handle.net/1721.1/105939.

Full text
Abstract:
Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2016.
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Cataloged from student-submitted PDF version of thesis.
Includes bibliographical references (pages 75-77).
In this thesis, I describe a system I designed and implemented for interactively analyzing large electroencephalogram (EEG) datasets. Trained experts, known as encephalographers, analyze EEG data to determine if a patient has experienced an epileptic seizure. Since EEG analysis is time intensive for large datasets, there is a growing corpus of unanalyzed EEG data. Fast analysis is essential for building a set of example data of EEG results, allowing doctors to quickly classify the behavior of future EEG scans. My system aims to reduce the cost of analysis by providing near real-time interaction with the datasets. The system has three optimized layers handling the storage, computation, and visualization of the data. I evaluate the design choices for each layer and compare three dierent implementations across dierent workloads.
by Joshua Blum.
M. Eng.
APA, Harvard, Vancouver, ISO, and other styles
28

Hilton, Erwin. "Visual datasets for artificial intelligence agents." Thesis, Massachusetts Institute of Technology, 2018. http://hdl.handle.net/1721.1/119553.

Full text
Abstract:
Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2018.
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Cataloged from PDF version of thesis.
Includes bibliographical references (page 41).
In this thesis, I designed and implemented two visual dataset generation tool frameworks. With these tools, I introduce procedurally generated new data to test VQA agents and other visual Al models on. The first tool is Spatial IQ Generative Dataset (SIQGD). This tool generates images based on the Raven's Progressive Matrices spatial IQ examination metric. The second tool is a collection of 3D models along with a Blender3D extension that renders images of the models from multiple viewpoints along with their depth maps.
by Erwin Hilton.
M. Eng.
APA, Harvard, Vancouver, ISO, and other styles
29

Roizman, Violeta. "Flexible clustering algorithms for heterogeneous datasets." Electronic Thesis or Diss., université Paris-Saclay, 2021. http://www.theses.fr/2021UPASG002.

Full text
Abstract:
L'objectif de la segmentation de données ou clustering est de trouver des groupes homogènes en fonction d'une distance prédeterminée. Étant donnée sa nature non supervisée, le clustering peut être appliqué à tout type de données et peut s'affranchir de processus d'étiquetage (labels) qui peuvent s'avérer très coûteux. Parmi les algorithmes de clustering les plus populaires, celui basé sur le modèle de mélange gaussien (MMG) est particulièrement intéressant. En effet, cet algorithme est très intuitif et fonctionne très bien lorsque les groupes ont une forme elliptique.Cependant, le modèle MMG est peu performant lorsque les données ont une loi qui s'éloigne d'un mélange de lois gaussiennes.En effet, les performances de l'algorithme peuvent être fortement détériorées par la non-robustesse des estimateurs classiques impliqués dans l'ajustement du modèle lorsque les données contiennent des aberrations ou du bruit. De plus, le MMG de base utilisé dans des applications de clustering n'est pas bien adapté aux contextes de données de grande dimension.Dans cette thèse, nous proposons une approche alternative à la robustesse de la méthode MMG. Nous utilisons un modèle basé sur des distributions symétriques elliptiques, permettant de décrire une famille plus générale de distributions. En outre, nous introduisons des paramètres supplémentaires qui augmentent la flexibilité de notre modèle et conduisent à des généralisations des estimateurs robustes classiques. Afin d'étayer les premières conclusions quant à la robustesse de l'algorithme proposé, des analyses théoriques et pratiques sont faites. Elles permettent notamment de mettre en valeur le caractère général de ces travaux.Ensuite, nous nous intéressons au problème de rejet de valeurs aberrantes. Nous considérons une version robuste de la distance de Mahalanobis et nous étudions sa distribution. Une bonne connaissance de cette distribution est essentielle car elle permet de fixer un seuil de rejet pour la classification de nouvelles données.Enfin, nous abordons deux applications liées au traitement d'images radar avec une perspective de clustering. Tout d'abord, nous considérons un problème de segmentation d'images. En dernier lieu, nous adaptons l'algorithme développé dans cette thèse afin de résoudre le problème de détection de changements dans des séries temporelles d'images
The goal of the clustering task is to find groups of elements that are homogeneous with respect to a chosen distance. Given its unsupervised nature, clustering can be applied to any kind of data and there is no need to proceed to the costly labelling process. One of the most popular clustering algorithms is the one built on the Gaussian Mixture Model (GMM). This algorithm is very intuitive and works well when the clusters have an elliptical shape.Regardless of its popularity, the GMM model has a poor performance when the data points do not fulfil a basic assumption: Gaussian distributed clusters. The model performance can be strongly degraded by the non-robustness of the classical estimators implicated in the model fitting when the data contains outliers or noise.In this thesis, we give an alternative approach to the robustification of the GMM-EM method. We adopt a model based on Elliptical Symmetric distributions that manages to describe a more general range of distributions. Besides, we introduce extra parameters that increase the flexibility of our model and lead to generalizations of classical robust estimators. In order to support the robust claims about our algorithm, we provide theoretical and practical analyses that help to understand the general character of the proposal.Afterwards, we tackle the outlier rejection task. We consider a robust version of the Mahalanobis distance and study its distribution. Knowing the distribution helps us setting a rejection threshold.Finally, we address two applications related to radar images through a clustering perspective. First, we consider the image segmentation task. In the end, we apply our flexible algorithm to solve the change detection problem for image time series
APA, Harvard, Vancouver, ISO, and other styles
30

Katcoff, Abigail. "Aligning heterogenous single cell assay datasets." Thesis, Massachusetts Institute of Technology, 2019. https://hdl.handle.net/1721.1/123030.

Full text
Abstract:
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2019
Cataloged from student-submitted PDF version of thesis.
Includes bibliographical references (pages 51-53).
Pluripotent stem cells offer strong promise for regenerative medicine but the pluripotent cell state is poorly understood. The goal of this thesis is the development of methods to analyze how the multiple facets of cell state-including gene expression, chromosome contacts, and chromatin accessibility-relate in the context of stem cells. The variability of each of these characteristics cannot be deduced from population studies, and while recent advances in single-cell transcriptomics have led to the development of a number of different single-cell assays, datasets that collect multiple types of assays on the same cells are rare. In this thesis, we explore the ability of three methods to integrate datasets from different single-cell assays based on an existing paired single-cell dataset of ATAC-seq and RNA-seq for human A549 cells. We then apply these methods to map the variability between three single-cell datasets-ATAC-seq, RNA-seq, and Hi-C-on pluripotent mouse embryonic stem cells and assess the performance of these methods.
by Abigail Katcoff.
M. Eng.
M.Eng. Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science
APA, Harvard, Vancouver, ISO, and other styles
31

Gerick, Steven Anthony. "Information Engineering with E-Learning Datasets." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-265008.

Full text
Abstract:
The rapid growth of the E-learning industry necessitates a streamlined process for identifying actionable information in the user databases maintained by E-learning companies. This paper applies several traditional mathematical and some machine learning techniques to one such dataset with the goal of identifying patterns in user proficiency that are not readily apparent from simply viewing the data. We also analyze the applicability of such methods to the dataset in question and datasets like it. We find that many of the methods can reveal useful insights into the dataset, even if some methods are limited by the database structure and even when the database has fundamental limits to the fraction of variance that can be explained. We also find that such methods are much more applicable when dataset records have clear times and student grades have fine resolution. We also suggest several changes to the way data is gathered and recorded in order to make mass-application of machine learning techniques feasible to more datasets.
Snabb utveckling inom E-lärandesindustrin gör snabba och generaliserbara metoder för informationsutveckling med E-lärandesdatabaser nödvändiga. Detta arbete tillämpar olika traditionella maskininlärnings- och matematiska metoder i en sådan databas för att identifiera mönster i användarfärdighet som inte lätt kan upptäckas genom att läsa igenom databasen. Detta arbete analyserar även metodernas generaliserbarhet, särskilt var dem kan användas, deras nackdelar, och vad databaserna behöver uppfylla för att lätt kunna analyseras med metoderna. Vi finner att många av metoderna kan upplysa om strukturer och mönster i databasen även om metoderna begränsas i effektivitet och gene- raliserbarhet. Metoderna är också enklare att tillämpa när databasens artiklar associeras med tydliga tidpunkter och studenternas betyg har hög upplösning. Vi föreslår ändringar för datainsamlingstekniken som kan förenkla paralleli- serbara storskalig tillämpningar av maskininlärningsmetoder på många databaser samtidigt.
APA, Harvard, Vancouver, ISO, and other styles
32

Smith, Zach. "Joining and aggregating datasets using CouchDB." Master's thesis, University of Cape Town, 2018. http://hdl.handle.net/11427/29530.

Full text
Abstract:
Data mining typically requires implementing operations that involve cross-cutting entity boundaries and are awkward to implement in document-oriented databases. CouchDB, for example, models entities as documents, with highly isolated entity boundaries, and on which joins cannot be directly performed. This project shows how join and aggregation can be achieved across entity boundaries in such systems, as encountered for example in the pre-processing and exploration stages of educational data mining. A software stack is presented as a means by which this can be achieved; first, datasets are processed via ETL operations, then MapReduce is used to create indices of ordered and aggregated data. Finally, a Couchdb list function is used to iterate through these indices and perform joins, and to compute aggregated values on joined datasets such as variance and correlations. In terms of the case study, it is shown that the proposed approach to implementing cross-document joins and aggregation is effective and scalable. In addition, it was discovered that for the 2014 - 2016 UCT cohorts, NBT scores correlate better with final grades for the CSC1015F course than do Grade 12 results for English, Science and Mathematics.
APA, Harvard, Vancouver, ISO, and other styles
33

Somasundaram, Jyothilakshmi. "Releasing Recommendation Datasets while Preserving Privacy." Miami University / OhioLINK, 2011. http://rave.ohiolink.edu/etdc/view?acc_num=miami1306427987.

Full text
APA, Harvard, Vancouver, ISO, and other styles
34

Han, Qian. "Mining Shared Decision Trees between Datasets." Wright State University / OhioLINK, 2010. http://rave.ohiolink.edu/etdc/view?acc_num=wright1274807201.

Full text
APA, Harvard, Vancouver, ISO, and other styles
35

Joshi, Vineet. "Unsupervised Anomaly Detection in Numerical Datasets." University of Cincinnati / OhioLINK, 2015. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1427799744.

Full text
APA, Harvard, Vancouver, ISO, and other styles
36

Liu, Fang. "Mining Security Risks from Massive Datasets." Diss., Virginia Tech, 2017. http://hdl.handle.net/10919/78684.

Full text
Abstract:
Cyber security risk has been a problem ever since the appearance of telecommunication and electronic computers. In the recent 30 years, researchers have developed various tools to protect the confidentiality, integrity, and availability of data and programs. However, new challenges are emerging as the amount of data grows rapidly in the big data era. On one hand, attacks are becoming stealthier by concealing their behaviors in massive datasets. One the other hand, it is becoming more and more difficult for existing tools to handle massive datasets with various data types. This thesis presents the attempts to address the challenges and solve different security problems by mining security risks from massive datasets. The attempts are in three aspects: detecting security risks in the enterprise environment, prioritizing security risks of mobile apps and measuring the impact of security risks between websites and mobile apps. First, the thesis presents a framework to detect data leakage in very large content. The framework can be deployed on cloud for enterprise and preserve the privacy of sensitive data. Second, the thesis prioritizes the inter-app communication risks in large-scale Android apps by designing new distributed inter-app communication linking algorithm and performing nearest-neighbor risk analysis. Third, the thesis measures the impact of deep link hijacking risk, which is one type of inter-app communication risks, on 1 million websites and 160 thousand mobile apps. The measurement reveals the failure of Google's attempts to improve the security of deep links.
Ph. D.
APA, Harvard, Vancouver, ISO, and other styles
37

Siddique, Nahian A. "PATTERN RECOGNITION IN CLASS IMBALANCED DATASETS." VCU Scholars Compass, 2016. http://scholarscompass.vcu.edu/etd/4480.

Full text
Abstract:
Class imbalanced datasets constitute a significant portion of the machine learning problems of interest, where recog­nizing the ‘rare class’ is the primary objective for most applications. Traditional linear machine learning algorithms are often not effective in recognizing the rare class. In this research work, a specifically optimized feed-forward artificial neural network (ANN) is proposed and developed to train from moderate to highly imbalanced datasets. The proposed methodology deals with the difficulty in classification task in multiple stages—by optimizing the training dataset, modifying kernel function to generate the gram matrix and optimizing the NN structure. First, the training dataset is extracted from the available sample set through an iterative process of selective under-sampling. Then, the proposed artificial NN comprises of a kernel function optimizer to specifically enhance class boundaries for imbalanced datasets by conformally transforming the kernel functions. Finally, a single hidden layer weighted neural network structure is proposed to train models from the imbalanced dataset. The proposed NN architecture is derived to effectively classify any binary dataset with even very high imbalance ratio with appropriate parameter tuning and sufficient number of processing elements. Effectiveness of the proposed method is tested on accuracy based performance metrics, achieving close to and above 90%, with several imbalanced datasets of generic nature and compared with state of the art methods. The proposed model is also used for classification of a 25GB computed tomographic colonography database to test its applicability for big data. Also the effectiveness of under-sampling, kernel optimization for training of the NN model from the modified kernel gram matrix representing the imbalanced data distribution is analyzed experimentally. Computation time analysis shows the feasibility of the system for practical purposes. This report is concluded with discussion of prospect of the developed model and suggestion for further development works in this direction.
APA, Harvard, Vancouver, ISO, and other styles
38

Fraser, Ross Macdonald. "Computational analysis of nucleosome positioning datasets." Thesis, University of Edinburgh, 2006. http://hdl.handle.net/1842/29110.

Full text
Abstract:
Monomer extension (ME) is an established in vitro experimental technique which maps the positions adopted by reconstituted core histone octamers on a defined DNA sequence. It provides quantitative positioning information, at high resolution, over long continuous stretches of DNA sequence. This technique has been employed to map several genes: globin genes (8 kbp), the beta-lactoglobulin gene (10 kbp) and various imprinting genes (4 kbp). This study explores and analyses this unique dataset, utilising computational and stochastic techniques, to gain insight into the potential influence of nucleosomal positioning on the structure and function of chromatin. The first section of this thesis expands upon prior analyses, explores general features of the dataset using common bioinformatics tools, and attempts to relate the quantitative positioning information from ME to data from other commonly used competitive reconstitution protocols. Finally, evidence of a correlation between the in vitro ME dataset and in vivo nucleosome positions for the beta-lactoglobulin gene region is presented. The second section presents the development of a novel method for the analysis of ME maps using Monte Carlo simulation methods. The goal was to use the ME datasets to simulate a higher order chromatin fibre, taking advantage of the long-range and quantitative nature of the ME datasets. The Monte Carlo simulations have allowed new insights to be gleaned from the datasets. Analysis of the beta-lactoglobulin positioning map indicates the potential for discrete disruption of nucleosomal organisation, at specific physiological nucleosome densities, over regions found to have unusual chromatin structure in vivo. This suggests a correspondence between the quantitative histone octamer positioning information in vitro and the positioning of nucleosomes in vivo. Taken together, these studies lend weight to the hypothesis that necleosome positioning information encoded within DNA plays a fundamental role in directing chromatin structure in vivo.
APA, Harvard, Vancouver, ISO, and other styles
39

Tao, F. "Data mining for relationships in large datasets." Thesis, Queen's University Belfast, 2003. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.273298.

Full text
APA, Harvard, Vancouver, ISO, and other styles
40

Talár, Ondřej. "Redukce šumu audionahrávek pomocí hlubokých neuronových sítí." Master's thesis, Vysoké učení technické v Brně. Fakulta elektrotechniky a komunikačních technologií, 2017. http://www.nusl.cz/ntk/nusl-317118.

Full text
Abstract:
The thesis focuses on the use of deep recurrent neural network, architecture Long Short-Term Memory for robust denoising of audio signal. LSTM is currently very attractive due to its characteristics to remember previous weights, or edit them not only according to the used algorithms, but also by examining changes in neighboring cells. The work describes the selection of the initial dataset and used noise along with the creation of optimal test data. For creation of the training network is selected KERAS framework for Python and are explored and discussed possible candidates for viable solutions.
APA, Harvard, Vancouver, ISO, and other styles
41

Romuld, Daniel, and Markus Ruhmén. "Compiling attention datasets : Developing a method for annotating face datasets with human performance attention labels using crowdsourcing." Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-166708.

Full text
Abstract:
This essay expands on the problem of human attention detection in computer vision. This is achieved by providing a method for annotating existing face datasets with attention labels through the use of human intelligence. The work described in this essay is justified by a lack of human performance attention datasets and the potential uses of the developed method. Several images of crowds were generated using the Labeled Faces in the Wild dataset of images depicting faces. Thus enabling evaluation of the level of attention of the depicted subjects as part of a crowd. The data collection methodology was carefully designed to maximise reliability and usability of the resulting dataset. The crowd images were evaluated by workers on the crowdsourcing platform CrowdFlower, which yielded human performance attention labels. Analysis of the results showed that the submissions from workers on the crowdsourcing platform displayed a high level of consistency and reliability. Hence, the developed method, although not fully optimised, was deemed to be a valid process for creating a representation of human attention in a dataset.
Denna uppsats behandlar problemet med att upptäcka mänsklig uppmärksamhet, vilket är ett problem inom datorseende. För att göra framsteg mot att lösa problemet utvecklades en metod för att skapa uppmärksamhetsmärkningar till dataset av ansiktsbilder. Märkningarna utgör ett mått av den uppfattade uppmärksamhetsnivån hos personerna i bilderna. Arbetet i denna uppsats motiveras av avsaknaden av dataset med uppmärksamhetsmärkningar och den potentiella användbarheten av den framtagna metoden. Metoden konstruerades med fokus på att maximera tillförlitligheten och användbarheten av insamlad data och det resulterande datasetet. Som ett första steg i metodutvecklingen genererades bilder på folkmassor genom att använda datasetet Labeled Faces in the Wild. Evaluering av uppmärksamhetsnivån hos personerna i bilderna, som individer i en folkmassa, blev då möjligt. Denna egenskap utvärderades av arbetare på crowdsourcing-plattformen CrowdFlower. Svaren analyserades och kombinerades för att beräkna ett uppmärksamhetsmått med mänsklig prestanda för varje individ i bilderna. Resultatanalysen visade att svaren från arbetarna på CrowdFlower var tillförlitliga med hög intern konsistens. Den framtagna metoden ansågs vara ett giltigt tillvägagångssätt för att skapa uppmärksamhetsmärkningar. Möjliga förbättringar identifierades i flera delar av metoden och redovisas som del av uppsatsens huvudresultat.
APA, Harvard, Vancouver, ISO, and other styles
42

Liu, Qing Computer Science &amp Engineering Faculty of Engineering UNSW. "Summarization of very large spatial dataset." Awarded by:University of New South Wales. School of Computer Science and Engineering, 2006. http://handle.unsw.edu.au/1959.4/25489.

Full text
Abstract:
Nowadays there are a large number of applications, such as digital library information retrieval, business data analysis, CAD/CAM, multimedia applications with images and sound, real-time process control and scientific computation, with data sets about gigabytes, terabytes or even petabytes. Because data distributions are too large to be stored accurately, maintaining compact and accurate summarized information about underlying data is of crucial important. The summarizing problem for Level 1 (disjoint and non-disjoint) topological relationship has been well studied for the past few years. However the spatial database users are often interested in a much richer set of spatial relations such as contains. Little work has been done on summarization for Level 2 topological relationship which includes contains, contained, overlap, equal and disjoint relations. We study the problem of effective summatization to represent the underlying data distribution to answer window queries for Level 2 topological relationship. Cell-density based approach has been demonstrated as an effective way to this problem. But the challenges are the accuracy of the results and the storage space required which should be linearly proportional to the number of cells to be practical. In this thesis, we present several novel techniques to effectively construct cell density based spatial histograms. Based on the framework proposed, exact results could be obtained in constant time for aligned window queries. To minimize the storage space of the framework, an approximate algorithm with the approximate ratio 19/12 is presented, while the problem is shown NP-hard generally. Because the framework requires only a storage space linearly proportional to the number of cells, it is practical for many popular real datasets. To conform to a limited storage space, effective histogram construction and query algorithms are proposed which can provide approximate results but with high accuracy. The problem for non-aligned window queries is also investigated and techniques of un-even partitioned space are developed to support non-aligned window queries. Finally, we extend our techniques to 3D space. Our extensive experiments against both synthetic and real world datasets demonstrate the efficiency of the algorithms developed in this thesis.
APA, Harvard, Vancouver, ISO, and other styles
43

Lamichhane, Niraj. "Prediction of Travel Time and Development of Flood Inundation Maps for Flood Warning System Including Ice Jam Scenario. A Case Study of the Grand River, Ohio." Youngstown State University / OhioLINK, 2016. http://rave.ohiolink.edu/etdc/view?acc_num=ysu1463789508.

Full text
APA, Harvard, Vancouver, ISO, and other styles
44

Akuney, Arseniy. "Information flow identification in large email datasets." Thesis, University of British Columbia, 2011. http://hdl.handle.net/2429/39847.

Full text
Abstract:
Identifying information flow in emails is an important, yet challenging task. In this work we investigate several algorithms for identifying similar sentences in large email datasets, as well as an algorithm for reconstructing threads from unstructured emails. We present a detailed evaluation of each algorithm in terms of accuracy and time performance. We also investigate the usage of cloud computing in order to increase computational efficiency and make information discovery usable in real time.
APA, Harvard, Vancouver, ISO, and other styles
45

Kolloju, Naresh Kumar. "Flexible and efficient exploration of rated datasets." Thesis, University of British Columbia, 2013. http://hdl.handle.net/2429/44028.

Full text
APA, Harvard, Vancouver, ISO, and other styles
46

Östergaard, Johan. "Planet Rendering Using Online High-Resolution Datasets." Thesis, Linköpings universitet, Institutionen för teknik och naturvetenskap, 2008. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-95360.

Full text
Abstract:
A large amount of image datasets are publicly available online through the web map service standard. The contents of the datasets span from satellite imagery of the earth and other planets of our solar system to various scientific data. This thesis presents an implementation of a planet renderer streaming high resolution image and height data from online datasets. The planet renderer uses a level of detail technique based on nodes connected by a quad tree data structure. The level of detail is determined by the size of the nodes in screen coordinates which ensures a sufficient texel to pixel ratio of the image data and a reasonably consistent polygon size in screen space. The project will be implemented in Uniview which is an application developed by SCISS AB and is primarily used in planetarium domes to visualize the Universe. Being able to visualize the high resolution image data and scientific measurements in a dome environment will provide the viewers a greater perspective of the data.
APA, Harvard, Vancouver, ISO, and other styles
47

DUSI, VENKATA SATYA SRIDHAR. "AUTOMATED DETECTION OF FEATURES IN CFD DATASETS." MSSTATE, 2001. http://sun.library.msstate.edu/ETD-db/theses/available/etd-11082001-152601/.

Full text
Abstract:
Typically, computational fluid dynamic (CFD) solutions produce large amounts of data that can be used for analysis. The enormous amount of data produces new challenges for effective exploration. The prototype system EVITA, based on ranked access of application-specific regions of interest, provides an effective tool for this purpose. Automated feature detection techniques are needed to identify the features in the dataset. Automated techniques for detecting shocks, expansion regions, vortices, separation lines, and attachment lines have already been developed. A new approach for identifying the regions of flow separation is proposed. This technique assumes that each pair of separation and attachment lines has a vortex core associated with it. It is based on the velocity field in the plane perpendicular to the vortex core. The present work describes these methods along with the results obtained.
APA, Harvard, Vancouver, ISO, and other styles
48

Goldstein, Markus [Verfasser]. "Anomaly Detection in Large Datasets / Markus Goldstein." München : Verlag Dr. Hut, 2014. http://d-nb.info/1052374948/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
49

Osman, Ahmad. "Automated evaluation of three dimensional ultrasonic datasets." Phd thesis, INSA de Lyon, 2013. http://tel.archives-ouvertes.fr/tel-00995119.

Full text
Abstract:
Non-destructive testing has become necessary to ensure the quality of materials and components either in-service or at the production stage. This requires the use of a rapid, robust and reliable testing technique. As a main testing technique, the ultrasound technology has unique abilities to assess the discontinuity location, size and shape. Such information play a vital role in the acceptance criteria which are based on safety and quality requirements of manufactured components. Consequently, an extensive usage of the ultrasound technique is perceived especially in the inspection of large scale composites manufactured in the aerospace industry. Significant technical advances have contributed into optimizing the ultrasound acquisition techniques such as the sampling phased array technique. However, acquisition systems need to be complemented with an automated data analysis procedure to avoid the time consuming manual interpretation of all produced data. Such a complement would accelerate the inspection process and improve its reliability. The objective of this thesis is to propose an analysis chain dedicated to automatically process the 3D ultrasound volumes obtained using the sampling phased array technique. First, a detailed study of the speckle noise affecting the ultrasound data was conducted, as speckle reduces the quality of ultrasound data. Afterward, an analysis chain was developed, composed of a segmentation procedure followed by a classification procedure. The proposed segmentation methodology is adapted for ultrasound 3D data and has the objective to detect all potential defects inside the input volume. While the detection of defects is vital, one main difficulty is the high amount of false alarms which are detected by the segmentation procedure. The correct distinction of false alarms is necessary to reduce the rejection ratio of safe parts. This has to be done without risking missing true defects. Therefore, there is a need for a powerful classifier which can efficiently distinguish true defects from false alarms. This is achieved using a specific classification approach based on data fusion theory. The chain was tested on several ultrasound volumetric measures of Carbon Fiber Reinforced Polymers components. Experimental results of the chain revealed high accuracy, reliability in detecting, characterizing and classifying defects.
APA, Harvard, Vancouver, ISO, and other styles
50

Obalappa, Dinesh Tretiak Oleh J. "Optimal caching of large multi-dimensional datasets /." Philadelphia, Pa. : Drexel University, 2004. http://dspace.library.drexel.edu/handle/1860/307.

Full text
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography