Log in

Relevant bibliographies by topics / Synthetic datasets / Dissertations / Theses

Dissertations / Theses on the topic 'Synthetic datasets'

To see the other types of publications on this topic, follow the link: Synthetic datasets.

Author: Grafiati

Published: 10 December 2022

Last updated: 22 February 2023

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 26 dissertations / theses for your research on the topic 'Synthetic datasets.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

D'Agostino, Alessandro. "Automatic generation of synthetic datasets for digital pathology image analysis." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2020. http://amslaurea.unibo.it/21722/.

Full text

Abstract:

The project is inspired by an actual problem of timing and accessibility in the analysis of histological samples in the health-care system. In this project, I address the problem of synthetic histological image generation for the purpose of training Neural Networks for the segmentation of real histological images. The collection of real histological human-labeled samples is a very time consuming and expensive process and often is not representative of healthy samples, for the intrinsic nature of the medical analysis. The method I propose is based on the replication of the traditional specimen preparation technique in a virtual environment. The first step is the creation of a 3D virtual model of a region of the target human tissue. The model should represent all the key features of the tissue, and the richer it is the better will be the yielded result. The second step is to perform a sampling of the model through a virtual tomography process, which produces a first completely labeled image of the section. This image is then processed with different tools to achieve a histological-like aspect. The most significant aesthetical post-processing is given by the action of a style transfer neural network that transfers the typical histological visual texture on the synthetic image. This procedure is presented in detail for two specific models: one of pancreatic tissue and one of dermal tissue. The two resulting images compose a pair of images suitable for a supervised learning technique. The generation process is completely automatized and does not require the intervention of any human operator, hence it can be used to produce arbitrary large datasets. The synthetic images are inevitably less complex than the real samples and they offer an easier segmentation task to solve for the NN. However, the synthetic images are very abundant, and the training of a NN can take advantage of this feature, following the so-called curriculum learning strategy.

APA, Harvard, Vancouver, ISO, and other styles

2

Hummel, Georg Verfasser], Peter [Akademischer Betreuer] [Stütz, and Paolo [Gutachter] Remagnino. "On synthetic datasets for development of computer vision algorithms in airborne reconnaissance applications / Georg Hummel ; Gutachter: Peter Stütz, Paolo Remagnino ; Akademischer Betreuer: Peter Stütz ; Universität der Bundeswehr München, Fakultät für Luft- und Raumfahrttechnik." Neubiberg : Universitätsbibliothek der Universität der Bundeswehr München, 2017. http://d-nb.info/1147386331/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

3

Hummel, Georg [Verfasser], Peter [Akademischer Betreuer] [Gutachter] Stütz, and Paolo [Gutachter] Remagnino. "On synthetic datasets for development of computer vision algorithms in airborne reconnaissance applications / Georg Hummel ; Gutachter: Peter Stütz, Paolo Remagnino ; Akademischer Betreuer: Peter Stütz ; Universität der Bundeswehr München, Fakultät für Luft- und Raumfahrttechnik." Neubiberg : Universitätsbibliothek der Universität der Bundeswehr München, 2017. http://d-nb.info/1147386331/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

4

Zhao, Amy(Xiaoyu Amy). "Learning distributions of transformations from small datasets for applied image synthesis." Thesis, Massachusetts Institute of Technology, 2019. https://hdl.handle.net/1721.1/128342.

Full text

Abstract:

Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2020
Cataloged from PDF of thesis. "February 2020."
Includes bibliographical references (pages 75-91).
Much of the recent research in machine learning and computer vision focuses on applications with large labeled datasets. However, in realistic settings, it is much more common to work with limited data. In this thesis, we investigate two applications of image synthesis using small datasets. First, we demonstrate how to use image synthesis to perform data augmentation, enabling the use of supervised learning methods with limited labeled data. Data augmentation -- typically the application of simple, hand-designed transformations such as rotation and scaling -- is often used to expand small datasets. We present a method for learning complex data augmentation transformations, producing examples that are more diverse, realistic, and useful for training supervised systems than hand-engineered augmentation. We demonstrate our proposed augmentation method for improving few-shot object classification performance, using a new dataset of collectible cards with fine-grained differences. We also apply our method to medical image segmentation, enabling the training of a supervised segmentation system using just a single labeled example. In our second application, we present a novel image synthesis task: synthesizing time lapse videos of the creation of digital and watercolor paintings. Using a recurrent model of paint strokes and a novel training scheme, we create videos that tell a plausible visual story of the painting process.
by Amy (Xiaoyu) Zhao.
Ph. D.
Ph.D. Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science

APA, Harvard, Vancouver, ISO, and other styles

5

He, Wenbin. "Exploration and Analysis of Ensemble Datasets with Statistical and Deep Learning Models." The Ohio State University, 2019. http://rave.ohiolink.edu/etdc/view?acc_num=osu1574695259847734.

Full text

APA, Harvard, Vancouver, ISO, and other styles

6

Bartocci, John Timothy. "Generating a synthetic dataset for kidney transplantation using generative adversarial networks and categorical logit encoding." Bowling Green State University / OhioLINK, 2021. http://rave.ohiolink.edu/etdc/view?acc_num=bgsu1617104572023027.

Full text

APA, Harvard, Vancouver, ISO, and other styles

7

Choudhury, Ananya. "WiSDM: a platform for crowd-sourced data acquisition, analytics, and synthetic data generation." Thesis, Virginia Tech, 2016. http://hdl.handle.net/10919/72256.

Full text

Abstract:

Human behavior is a key factor influencing the spread of infectious diseases. Individuals adapt their daily routine and typical behavior during the course of an epidemic -- the adaptation is based on their perception of risk of contracting the disease and its impact. As a result, it is desirable to collect behavioral data before and during a disease outbreak. Such data can help in creating better computer models that can, in turn, be used by epidemiologists and policy makers to better plan and respond to infectious disease outbreaks. However, traditional data collection methods are not well suited to support the task of acquiring human behavior related information; especially as it pertains to epidemic planning and response. Internet-based methods are an attractive complementary mechanism for collecting behavioral information. Systems such as Amazon Mechanical Turk (MTurk) and online survey tools provide simple ways to collect such information. This thesis explores new methods for information acquisition, especially behavioral information that leverage this recent technology. Here, we present the design and implementation of a crowd-sourced surveillance data acquisition system -- WiSDM. WiSDM is a web-based application and can be used by anyone with access to the Internet and a browser. Furthermore, it is designed to leverage online survey tools and MTurk; WiSDM can be embedded within MTurk in an iFrame. WiSDM has a number of novel features, including, (i) ability to support a model-based abductive reasoning loop: a flexible and adaptive information acquisition scheme driven by causal models of epidemic processes, (ii) question routing: an important feature to increase data acquisition efficacy and reduce survey fatigue and (iii) integrated surveys: interactive surveys to provide additional information on survey topic and improve user motivation. We evaluate the framework's performance using Apache JMeter and present our results. We also discuss three other extensions of WiSDM: Adapter, Synthetic Data Generator, and WiSDM Analytics. The API Adapter is an ETL extension of WiSDM which enables extracting data from disparate data sources and loading to WiSDM database. The Synthetic Data Generator allows epidemiologists to build synthetic survey data using NDSSL's Synthetic Population as agents. WiSDM Analytics empowers users to perform analysis on the data by writing simple python code using Versa APIs. We also propose a data model that is conducive to survey data analysis.
Master of Science

APA, Harvard, Vancouver, ISO, and other styles

8

Šlosár, Peter. "Generátor syntetické datové sady pro dopravní analýzu." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2014. http://www.nusl.cz/ntk/nusl-236021.

Full text

Abstract:

This Master's thesis deals with the design and development of tools for generating a synthetic dataset for traffic analysis purposes. The first part contains a brief introduction to the vehicle detection and rendering methods. Blender and the set of scripts are used to create highly customizable training images dataset and synthetic videos from a single photograph. Great care is taken to create very realistic output, that is suitable for further processing in field of traffic analysis. Produced images and videos are automatically richly annotated. Achieved results are tested by training a sample car detector and evaluated with real life testing data. Synthetic dataset outperforms real training datasets in this comparison of the detection rate. Computational demands of the tools are evaluated as well. The final part sums up the contribution of this thesis and outlines some extensions of the tools for the future.

APA, Harvard, Vancouver, ISO, and other styles

9

Oškera, Jan. "Detekce dopravních značek a semaforů." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2020. http://www.nusl.cz/ntk/nusl-432850.

Full text

Abstract:

The thesis focuses on modern methods of traffic sign detection and traffic lights detection directly in traffic and with use of back analysis. The main subject is convolutional neural networks (CNN). The solution is using convolutional neural networks of YOLO type. The main goal of this thesis is to achieve the greatest possible optimization of speed and accuracy of models. Examines suitable datasets. A number of datasets are used for training and testing. These are composed of real and synthetic data sets. For training and testing, the data were preprocessed using the Yolo mark tool. The training of the model was carried out at a computer center belonging to the virtual organization MetaCentrum VO. Due to the quantifiable evaluation of the detector quality, a program was created statistically and graphically showing its success with use of ROC curve and evaluation protocol COCO. In this thesis I created a model that achieved a success average rate of up to 81 %. The thesis shows the best choice of threshold across versions, sizes and IoU. Extension for mobile phones in TensorFlow Lite and Flutter have also been created.

APA, Harvard, Vancouver, ISO, and other styles

10

Kola, Ramya Sree. "Generation of synthetic plant images using deep learning architecture." Thesis, Blekinge Tekniska Högskola, Institutionen för datavetenskap, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-18450.

Full text

Abstract:

Background: Generative Adversarial Networks (Goodfellow et al., 2014) (GANs)are the current state of the art machine learning data generating systems. Designed with two neural networks in the initial architecture proposal, generator and discriminator. These neural networks compete in a zero-sum game technique, to generate data having realistic properties inseparable to that of original datasets. GANs have interesting applications in various domains like Image synthesis, 3D object generation in gaming industry, fake music generation(Dong et al.), text to image synthesis and many more. Despite having a widespread application domains, GANs are popular for image data synthesis. Various architectures have been developed for image synthesis evolving from fuzzy images of digits to photorealistic images. Objectives: In this research work, we study various literature on different GAN architectures. To understand significant works done essentially to improve the GAN architectures. The primary objective of this research work is synthesis of plant images using Style GAN (Karras, Laine and Aila, 2018) variant of GAN using style transfer. The research also focuses on identifying various machine learning performance evaluation metrics that can be used to measure Style GAN model for the generated image datasets. Methods: A mixed method approach is used in this research. We review various literature work on GANs and elaborate in detail how each GAN networks are designed and how they evolved over the base architecture. We then study the style GAN (Karras, Laine and Aila, 2018a) design details. We then study related literature works on GAN model performance evaluation and measure the quality of generated image datasets. We conduct an experiment to implement the Style based GAN on leaf dataset(Kumar et al., 2012) to generate leaf images that are similar to the ground truth. We describe in detail various steps in the experiment like data collection, preprocessing, training and configuration. Also, we evaluate the performance of Style GAN training model on the leaf dataset. Results: We present the results of literature review and the conducted experiment to address the research questions. We review and elaborate various GAN architecture and their key contributions. We also review numerous qualitative and quantitative evaluation metrics to measure the performance of a GAN architecture. We then present the generated synthetic data samples from the Style based GAN learning model at various training GPU hours and the latest synthetic data sample after training for around ~8 GPU days on leafsnap dataset (Kumar et al., 2012). The results we present have a decent quality to expand the dataset for most of the tested samples. We then visualize the model performance by tensorboard graphs and an overall computational graph for the learning model. We calculate the Fréchet Inception Distance score for our leaf Style GAN and is observed to be 26.4268 (the lower the better). Conclusion: We conclude the research work with an overall review of sections in the paper. The generated fake samples are much similar to the input ground truth and appear to be convincingly realistic for a human visual judgement. However, the calculated FID score to measure the performance of the leaf StyleGAN accumulates a large value compared to that of Style GANs original celebrity HD faces image data set. We attempted to analyze the reasons for this large score.

APA, Harvard, Vancouver, ISO, and other styles

11

Baraheem, Samah Saeed. "Text to Image Synthesis via Mask Anchor Points and Aesthetic Assessment." University of Dayton / OhioLINK, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=dayton158800567702413.

Full text

APA, Harvard, Vancouver, ISO, and other styles

12

Arcidiacono, Claudio Salvatore. "An empirical study on synthetic image generation techniques for object detectors." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-235502.

Full text

Abstract:

Convolutional Neural Networks are a very powerful machine learning tool that outperformed other techniques in image recognition tasks. The biggest drawback of this method is the massive amount of training data required, since producing training data for image recognition tasks is very labor intensive. To tackle this issue, different techniques have been proposed to generate synthetic training data automatically. These synthetic data generation techniques can be grouped in two categories: the first category generates synthetic images using computer graphic software and CAD models of the objects to recognize; the second category generates synthetic images by cutting the object from an image and pasting it on another image. Since both techniques have their pros and cons, it would be interesting for industries to investigate more in depth the two approaches. A common use case in industrial scenarios is detecting and classifying objects inside an image. Different objects appertaining to classes relevant in industrial scenarios are often undistinguishable (for example, they all the same component). For these reasons, this thesis work aims to answer the research question “Among the CAD model generation techniques, the Cut-paste generation techniques and a combination of the two techniques, which technique is more suitable for generating images for training object detectors in industrial scenarios”. In order to answer the research question, two synthetic image generation techniques appertaining to the two categories are proposed.The proposed techniques are tailored for applications where all the objects appertaining to the same class are indistinguishable, but they can also be extended to other applications. The two synthetic image generation techniques are compared measuring the performances of an object detector trained using synthetic images on a test dataset of real images. The performances of the two synthetic data generation techniques used for data augmentation have been also measured. The empirical results show that the CAD models generation technique works significantly better than the Cut-Paste generation technique where synthetic images are the only source of training data (61% better),whereas the two generation techniques perform equally good as data augmentation techniques. Moreover, the empirical results show that the models trained using only synthetic images performs almost as good as the model trained using real images (7,4% worse) and that augmenting the dataset of real images using synthetic images improves the performances of the model (9,5% better).
Konvolutionella neurala nätverk är ett mycket kraftfullt verktyg för maskininlärning som överträffade andra tekniker inom bildigenkänning. Den största nackdelen med denna metod är den massiva mängd träningsdata som krävs, eftersom det är mycket arbetsintensivt att producera träningsdata för bildigenkänningsuppgifter. För att ta itu med detta problem har olika tekniker föreslagits för att generera syntetiska träningsdata automatiskt. Dessa syntetiska datagenererande tekniker kan grupperas i två kategorier: den första kategorin genererar syntetiska bilder med hjälp av datorgrafikprogram och CAD-modeller av objekten att känna igen; Den andra kategorin genererar syntetiska bilder genom att klippa objektet från en bild och klistra in det på en annan bild. Eftersom båda teknikerna har sina fördelar och nackdelar, skulle det vara intressant för industrier att undersöka mer ingående de båda metoderna. Ett vanligt fall i industriella scenarier är att upptäcka och klassificera objekt i en bild. Olika föremål som hänför sig till klasser som är relevanta i industriella scenarier är ofta oskiljbara (till exempel de är alla samma komponent). Av dessa skäl syftar detta avhandlingsarbete till att svara på frågan “Bland CAD-genereringsteknikerna, Cut-paste generationsteknikerna och en kombination av de två teknikerna, vilken teknik är mer lämplig för att generera bilder för träningsobjektdetektorer i industriellascenarier”. För att svara på forskningsfrågan föreslås två syntetiska bildgenereringstekniker som hänför sig till de två kategorierna. De föreslagna teknikerna är skräddarsydda för applikationer där alla föremål som tillhör samma klass är oskiljbara, men de kan också utökas till andra applikationer. De två syntetiska bildgenereringsteknikerna jämförs med att mäta prestanda hos en objektdetektor som utbildas med hjälp av syntetiska bilder på en testdataset med riktiga bilder. Föreställningarna för de två syntetiska datagenererande teknikerna som används för dataförökning har också uppmätts. De empiriska resultaten visar att CAD-modelleringstekniken fungerar väsentligt bättre än Cut-Paste-genereringstekniken, där syntetiska bilder är den enda källan till träningsdata (61% bättre), medan de två generationsteknikerna fungerar lika bra som dataförstoringstekniker. Dessutom visar de empiriska resultaten att modellerna som utbildats med bara syntetiska bilder utför nästan lika bra som modellen som utbildats med hjälp av riktiga bilder (7,4% sämre) och att förstora datasetet med riktiga bilder med hjälp av syntetiska bilder förbättrar modellens prestanda (9,5% bättre).

APA, Harvard, Vancouver, ISO, and other styles

13

Pazderka, Radek. "Segmentace obrazových dat pomocí hlubokých neuronových sítí." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2019. http://www.nusl.cz/ntk/nusl-403816.

Full text

Abstract:

This master's thesis is focused on segmentation of the scene from traffic environment. The solution to this problem is segmentation neural networks, which enables classification of every pixel in the image. In this thesis is created segmentation neural network, that has reached better results than present state-of-the-art architectures. This work is also focused on the segmentation of the top view of the road, as there are no freely available annotated datasets. For this purpose, there was created automatic tool for generation of synthetic datasets by using PC game Grand Theft Auto V. The work compares the networks, that have been trained solely on synthetic data and the networks that have been trained on both real and synthetic data. Experiments prove, that the synthetic data can be used for segmentation of the data from the real environment. There has been implemented a system, that enables work with segmentation neural networks.

APA, Harvard, Vancouver, ISO, and other styles

14

Diffner, Fredrik, and Hovig Manjikian. "Training a Neural Network using Synthetically Generated Data." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-280334.

Full text

Abstract:

A major challenge in training machine learning models is the gathering and labeling of a sufficiently large training data set. A common solution is the use of synthetically generated data set to expand or replace a real data set. This paper examines the performance of a machine learning model trained on synthetic data set versus the same model trained on real data. This approach was applied to the problem of character recognition using a machine learning model that implements convolutional neural networks. A synthetic data set of 1’240’000 images and two real data sets, Char74k and ICDAR 2003, were used. The result was that the model trained on the synthetic data set achieved an accuracy that was about 50% better than the accuracy of the same model trained on the real data set.
Vid utvecklandet av maskininlärningsmodeller kan avsaknaden av ett tillräckligt stort dataset för träning utgöra ett problem. En vanlig lösning är att använda syntetiskt genererad data för att antingen utöka eller helt ersätta ett dataset med verklig data. Denna uppsats undersöker prestationen av en maskininlärningsmodell tränad på syntetisk data jämfört med samma modell tränad på verklig data. Detta applicerades på problemet att använda ett konvolutionärt neuralt nätverk för att tyda tecken i bilder från ”naturliga” miljöer. Ett syntetiskt dataset bestående av 1’240’000 samt två stycken dataset med tecken från bilder, Char74K och ICDAR2003, användes. Resultatet visar att en modell tränad på det syntetiska datasetet presterade ca 50% bättre än samma modell tränad på Char74K.

APA, Harvard, Vancouver, ISO, and other styles

15

Klinkert, Rickard. "Uncertainty Analysis of Long Term Correction Methods for Annual Average Winds." Thesis, Umeå universitet, Institutionen för fysik, 2012. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-59690.

Full text

Abstract:

For the construction of a wind farm, one needs to assess the wind resources of the considered site location. Using reference time series from numerical weather prediction models, global assimilation databases or observations close to the area considered, the on-site measured wind speeds and wind directions are corrected in order to represent the actual long-term wind conditions. This long-term correction (LTC) is in the typical case performed by making use of the linear regression within the Measure-Correlate-Predict (MCP) method. This method and two other methods, Sector-Bin (SB) and Synthetic Time Series (ST), respectively, are used for the determination of the uncertainties that are associated with LTC.The test area that has been chosen in this work, is located in the region of the North Sea, using 22 quality controlled meteorological (met) station observations from offshore or nearby shore locations in Denmark, Norway and Sweden. The time series that has been used cover the eight year period from 2002 to 2009 and the year with the largest variability in the wind speeds, 2007, is used as the short-term measurement period. The long-term reference datasets that have been used are the Weather Research and Forecast model, based on both ECMWF Interim Re-Analysis (ERA-Interim) and National Centers for Environmental Prediction Final Analysis (NCEP/FNL), respectively and additional reference datasets of Modern Era Re-Analysis (MERRA) and QuikSCAT satellite observations. The long-term period for all of the reference datasets despite QuikSCAT, correspond to the one of stations observations. The QuikSCAT period of observations used cover the period from November 1st, 1999 until October 31st, 2009.The analysis is divided into three parts. Initially, the uncertainty connected to the corresponding reference dataset, when used in LTC method, is investigated. Thereafter the uncertainty due to the concurrent length of the on-site measurements and reference dataset is analyzed. Finally, the uncertainty is approached using a re-sampling method of the Non-Parametric Bootstrap. The uncertainty of the LTC method SB, for a fixed concurrent length of the datasets is assessed by this methodology, in an effort to create a generic model for the estimation of uncertainty in the predicted values for SB.The results show that LTC with WRF model datasets based on NCEP/FNL and ERA-Interim, respectively, is slightly different, but does not deviate considerably in comparison when comparing with met station observations. The results also suggest the use of MERRA reference dataset in connection with long-term correction methods. However, the datasets of QuikSCAT does not provide much information regarding the overall quality of long-term correction, and a different approach than using station coordinates for the withdrawal of QuikSCAT time series is preferred. Additionally, the LTC model of Sector-Bin is found to be robust against variation in the correlation coefficient between the concurrent datasets. For the uncertainty dependence of concurrent time, the results show that an on-site measurement period of one consistent year or more, gives the lowest uncertainties compared to measurements of shorter time. An additional observation is that the standard deviation of long-term corrected means decreases with concurrent time. Despite the efforts of using the re-sampling method of Non-Parametric Bootstrap the estimation of the uncertainties is not fully determined. However, it does give promising results that are suggested for investigation in further work.
För att bygga en vindkraftspark är man i behov av att kartlägga vindresurserna i det aktuella området. Med hjälp av tidsserier från numeriska vädermodeller (NWP), globala assimileringsdatabaser och intilliggande observationer korrigeras de uppmätta vindhastigheterna och vindriktningarna för att motsvara långtidsvärdena av vindförhållandena. Dessa långtidskorrigeringsmetoder (LTC) genomförs generellt sett med hjälp av linjär regression i Mät-korrelera-predikera-metoden (MCP). Denna metod, och två andra metoder, Sektor-bin (SB) och Syntetiska tidsserier (ST), används i denna rapport för att utreda de osäkerheter som är knutna till långtidskorrigering.Det testområde som är valt för analys i denna rapport omfattas av Nordsjöregionen, med 22 meteorologiska väderobservationsstationer i Danmark, Norge och Sverige. Dessa stationer är till största del belägna till havs eller vid kusten. Tidsserierna som används täcker åttaårsperioden från 2002 till 2009, där det året med högst variabilitet i uppmätt vindhastighet, år 2007, används som den korta mätperiod som blir föremål för långtidskorrigeringen. De långa referensdataseten som använts är väderprediktionsmodellen WRF ( Weather Research and Forecast Model), baserad både på data från NCEP/FNL (National Centers for Environmental Prediciton Final Analysis) och ERA-Interim (ECMWF Interim Re-analysis). Dessutom används även data från MERRA (Modern Era Re-Analysis) och satellitobservationer från QuikSCAT. Långtidsperioden för alla dataset utom QuikSCAT omfattar samma period som observationsstationerna. QuikSCAT-datat som använts omfattar perioden 1 november 1999 till 31 oktober 2009.Analysen är indelad i tre delar. Inledningsvis behandlas osäkerheten som är kopplad till referensdatans ingående i långtidskorrigeringsmetoderna. Därefter analyseras osäkerhetens beroende av längden på den samtidiga datan i referens- och observationsdataseten. Slutligen utreds osäkerheten med hjälp av en icke-parametrisk metod, en s.k. Bootstrap: Osäkerheten i SB-metoden för en fast samtidig längd av tidsserierna från observationer och referensdatat uppskattas genom att skapa en generell modell som estimerar osäkerheten i estimatet.Resultatet visar att skillnaden när man använder WRF-modellen baserad både på NCEP/FNL och ERA-Interim i långtidskorrigeringen är marginell och avviker inte markant i förhållande till stationsobservationerna. Resultatet pekar också på att MERRA-datat kan användas som långtidsreferensdataset i långtidsdkorrigeringsmetoderna. Däremot ger inte QuikSCAT-datasetet tillräckligt med information för att avgöra om det går att använda i långtidskorrigeringsmetoderna. Därför föreslås ett annat tillvägagångssätt än stationsspecifika koordinater vid val av koordinater lämpliga för långtidskorrigering. Ytterligare ett resultat vid analys av långtidskorrigeringsmetoden SB, visar att metoden är robust mot variation i korrelationskoefficienten.Rörande osäkerhetens beroende av längden på samtidig data visar resultaten att en sammanhängande mätperiod på ett år eller mer ger den lägsta osäkerheten i årsmedelvindsestimatet, i förhållande till mätningar av kortare slag. Man kan även se att standardavvikelsen av de långtidskorrigerade medelvärdena avtar med längden på det samtidiga datat. Den implementerade ickeparametriska metoden Bootstrap, som innefattar sampling med återläggning, kan inte estimera osäkerheten till fullo. Däremot ger den lovande resultat som föreslås för vidare arbete.

APA, Harvard, Vancouver, ISO, and other styles

16

Silva, Bárbara Sofia Lopez de Carvalho Ferreira da. "Automatic Generation of Synthetic Website Wireframe Datasets from Source Code." Master's thesis, 2020. https://hdl.handle.net/10216/128542.

Full text

APA, Harvard, Vancouver, ISO, and other styles

17

Silva, Bárbara Sofia Lopez de Carvalho Ferreira da. "Automatic Generation of Synthetic Website Wireframe Datasets from Source Code." Dissertação, 2020. https://hdl.handle.net/10216/128542.

Full text

APA, Harvard, Vancouver, ISO, and other styles

18

Drechsler, Jörg [Verfasser]. "Generating multiply imputed synthetic datasets : theory and implementation / vorgelegt von Jörg Drechsler." 2010. http://d-nb.info/1000445984/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

19

Lobo, João Pedro Pereira. "G-Tric: enhancing triclustering evaluation using three-way synthetic datasets with ground truth." Master's thesis, 2020. http://hdl.handle.net/10451/48350.

Full text

Abstract:

Tese de mestrado, Ciência de Dados, Universidade de Lisboa, Faculdade de Ciências, 2020
Three-dimensional datasets, or three-way data, started to gain popularity due to their increasing capacity to describe inherently multivariate and temporal events, such as biological responses, social interactions along time, urban dynamics, or complex geophysical phenomena. Triclustering, subspace clustering of three-way data, enables the discovery of patterns corresponding to data subspaces (triclusters) with values correlated across the three dimensions (observations _ features _ contexts). With an increasing number of algorithms being proposed, effectively comparing them with state-of-the-art algorithms is paramount.These comparisons are usually performed using real data, without a known ground-truth, thus limiting the assessments. In this context, we propose a synthetic data generator, G-Tric, allowing the creation of synthetic datasets with configurable properties and the possibility to plant triclusters. The generator is prepared to create datasets resembling real three-way data from biomedical and social data domains, with the additional advantage of further providing the ground truth (triclustering solution) as output. G-Tric can replicate real-world datasets and create new ones that match researchers’ needs across several properties, including data type (numeric or symbolic), dimension, and background distribution. Users can tune the patterns and structure that characterize the planted triclusters (subspaces) and how they interact (overlapping). Data quality can also be controlled by defining the number of missing values, noise, and errors. Furthermore, a benchmark of datasets resembling real data is made available, together with the corresponding triclustering solutions (planted triclusters) and generating parameters. Triclustering evaluation using G-Tric provides the possibility to combine both intrinsic and extrinsic metrics to compare solutions that produce more reliable analyses. A set of predefined datasets, mimicking widely used three-way data and exploring crucial properties was generated and made available, highlighting G-Tric’s potential to advance triclustering state-of-the-art by easing the process of evaluating the quality of new triclustering approaches. Besides reviewing the current state-of-the-art regarding triclustering approaches, comparison studies and evaluation metrics, this work also analyzes how the lack of frameworks to generate synthetic data influences existent evaluation methodologies, limiting the scope of performance insights that can be extracted from each algorithm. As well as exemplifying how the set of decisions made on these evaluations can impact the quality and validity of those results. Alternatively, a different methodology that takes advantage of synthetic data with ground truth is presented. This approach, combined with the proposal of an extension to an existing clustering extrinsic measure, enables to assess solutions’ quality under new perspectives.

APA, Harvard, Vancouver, ISO, and other styles

20

Su, Hua. "Large-scale snowpack estimation using ensemble data assimilation methodologies, satellite observations and synthetic datasets." 2009. http://hdl.handle.net/2152/7679.

Full text

Abstract:

This work focuses on a series of studies that contribute to the development and test of advanced large-scale snow data assimilation methodologies. Compared to the existing snow data assimilation methods and strategies, which are limited in the domain size and landscape coverage, the number of satellite sensors, and the accuracy and reliability of the product, the present work covers the continental domain, compares single- and multi-sensor data assimilations, and explores uncertainties in parameter and model structure. In the first study a continental-scale snow water equivalent (SWE) data assimilation experiment is presented, which incorporates Moderate Resolution Imaging Spectroradiometer (MODIS) snow cover fraction (SCF) data to Community Land Model (CLM) estimates via the ensemble Kalman filter (EnKF). The greatest improvements of the EnKF approach are centered in the mountainous West, the northern Great Plains, and the west and east coast regions, with the magnitude of corrections (compared to the use of model only) greater than one standard deviation (calculated from SWE climatology) at given areas. Relatively poor performance of the EnKF, however, is found in the boreal forest region. In the second study, snowpack related parameter and model structure errors were explicitly considered through a group of synthetic EnKF simulations which integrate synthetic datasets with model estimates. The inclusion of a new parameter estimation scheme augments the EnKF performance, for example, increasing the Nash-Sutcliffe efficiency of season-long SWE estimates from 0.22 (without parameter estimation) to 0.96. In this study, the model structure error is found to significantly impact the robustness of parameter estimation. In the third study, a multi-sensor snow data assimilation system over North America was developed and evaluated. It integrates both Gravity Recovery and Climate Experiment (GRACE) Terrestrial water storage (TWS) and MODIS SCF information into CLM using the ensemble Kalman filter (EnKF) and smoother (EnKS). This GRACE/MODIS data assimilation run achieves a significantly better performance over the MODIS only run in Saint Lawrence, Fraser, Mackenzie, Churchill & Nelson, and Yukon river basins. These improvements demonstrate the value of integrating complementary information for continental-scale snow estimation.
text

APA, Harvard, Vancouver, ISO, and other styles

21

Tsai, Meng-Fong, and 蔡孟峰. "Application and Study of imbalanced datasets base on Top-N Reverse k-Nearest Neighbor (TRkNN) coupled with Synthetic Minority Over-Sampling Technique (SMOTE)." Thesis, 2017. http://ndltd.ncl.edu.tw/handle/38104987938865711006.

Full text

Abstract:

博士
國立中興大學
資訊科學與工程學系
105
The imbalanced classification means the dataset has an unequal class distribution among its population. For a given dataset without considering the imbalanced issue, most classification methods often predict the high accuracy for the majority class, but significantly low accuracy for the minority class. The first task in this dissertation is to provide an efficient algorithm, Top-N Reverse k-Nearest Neighbor (TRkNN), coupled with Synthetic Minority Over-Sampling TEchnique (SMOTE) to overcome this issue for several imbalanced datasets from famous UCI datasets. To investigate the proposed algorithm, it was applied into different classified methods, such as Logistic regression, C4.5, SVM, and BPNN. In addition, this research also adopted different distance metrics to classify the same UCI datasets. The empirical results illustrated that the Euclidean distance and Manhattan distance not only perform higher percentage of accuracy rate, but also show greater computational efficiency than the Chebyshev distance and Cosine distance. Therefore, the TRkNN and SMOTE based algorithm could be widely used to handle the imbalanced datasets and how to choose the suitable distance metrics can be as the reference for the future researches. Research into cancer prediction has applied various machine learning algorithms, such as neural networks, genetic algorithms, and particle swarm optimization, to find the key to classifying illness or cancer properties or to adapt traditional statistical prediction models to effectively differentiate between different types of cancers, and thus build prediction models that can allow for early detection and treatment. Training data from existing patients is used to establish models to predict the classification accuracy of new patient samples. This issue has attracted considerable attention in the field of data mining, and scholars have proposed various methods (e.g., random sampling and feature selection) to address category imbalances and achieve a re-balanced class distribution, thus improving the effectiveness of classifiers with limited data. Although resampling methods can quickly deal with the problem of unbalanced samples, they give more importance to the data in the majority class, and neglect potentially important data in the minority class, thus limiting the effectiveness of classification. Based on patterns discovered in imbalanced medical data sets, the second task in this dissertation is to use the synthetic minority oversampling technique to improve imbalanced data set issues. In addition, this research also compares the resampling performance of various methods based on machine learning, soft-computing, and bio-inspired computing, using three UCI medical data sets.

APA, Harvard, Vancouver, ISO, and other styles

22

RUSSO, PAOLO. "Broadening deep learning horizons: models for RGB and depth images adaptation." Doctoral thesis, 2020. http://hdl.handle.net/11573/1365047.

Full text

Abstract:

Deep Learning has revolutionized the whole field of Computer Vision. Very deep models with an huge number of parameters have been successfully applied on big image datasets for difficult tasks like object classification, person re-identification, semantic segmentation. Two-fold results have been obtained: astonishing performance, with accuracy often comparable or better than a human counterpart on one hand, and on the other the development of robust, complex and powerful visual features which exhibit the ability to generalize to new visual tasks. Still, the success of Deep Learning methods relies on the availability of big datasets: whenever the available, labeled data is limited or redundant, a deep neural network model will typically overfit on training data, showing poor performance on new, unseen data. A typical solution used by the Deep Learning community in those cases is to rely on some Transfer Learning techniques; within the several available methods, the most successful one has been to pre-train the deep model on a big heterogeneous dataset (like ImageNet) and then to finetune the model on the available training data. Among several fields of application, this approach has been heavily used by the robotic community for depth images object recognition. Depth images are usually provided by depth sensors (eg. Kinect) and their availability is somewhat scarce: the biggest depth images dataset publicly available includes 50.000 samples, making the use of a pre-trained network the only successful method to exploit deep models on depth data. Without any doubt, this method provides suboptimal results as the network is trained on traditional RGB images having very different perceptual information with respect to depth maps; better results could be obtained if a big enough depth dataset would be available, enabling the training a deep model from scratch. Another frequent issue is the difference of statistical properties between training and test data (domain gap). In this case, even in the presence of enough training data, the generalization ability of the model will be poor, thus making the use of a Domain Adaptation method able to reduce the domains gap; this can improve both the robustness of the model and its final classification performances. In this thesis both problems have been tackled by developing a series of Deep Learning solutions for Domain Adaptation and Transfer Learning tasks on RGB and depth images domains: a new synthetic depth images dataset is presented, showing the performance of a deep model trained from scratch on depth-only data. At the same time, a new powerful depthRGB mapping module is analyzed, to optimize the classification accuracy on depth images tasks while using pretrained-on-ImageNet deep models. The study of the depth domain ends with a recurrent neural network for egocentric action recognition capable of exploiting depth images as an additional source of attention. A novel GAN model and an hybrid pixel/features adaptation architecture for RGB images have been developed: the former on single-domain adaptation tasks, while the latter on multidomain adaptation and generalization tasks. Finally, a preliminary approach to the problem of multi-source Domain Adaptation on a semantic segmentation task is examined, based on the combination of a multi-branch segmentation model and a adversarial technique, capable of exploiting all the available synthetic training datasets and to increase the overall performance. The performance obtained by using the proposed algorithms are often better or equivalent with respect to the currently available state of the art methods on several datasets and domains, demonstrating the superiority of our approach. Moreover, our analysis shows that the creation of ad-hoc domain adaptation and transfer learning techniques are mandatory in order to obtain the best accuracy in the presence of any domain gap, with a little or negligible additional computational cost.

APA, Harvard, Vancouver, ISO, and other styles

23

Shu-WeiLiao and 廖書緯. "A Local Information Based Synthetic Minority Oversampling Technique for Imbalanced Dataset Learning." Thesis, 2019. http://ndltd.ncl.edu.tw/handle/5mdht9.

Full text

Abstract:

碩士
國立成功大學
工業與資訊管理學系
107
A dataset is imbalanced if the classes are not approximately equally represented. Data mining on imbalanced datasets receives more and more attentions in recent years. The class imbalanced problem occurs when there’s just few number of sample in one classes comparing to other classes. The SMOTE : Synthetic Minority Over-Sampling Technique is an effective method to solve imbalanced learning problem. The way is to take one of the minority sample as the seed sample, and find the minority sample nearby as the selected sample. After finding seed sample and selected sample, we generate virtual sample between two minority samples. Therefore, in this paper we consider the influence between majority samples and the selected sample and the influence between minority samples and the selected sample. This study develops a new sample-generating procedure by local majority class information and local minority class information. Four datasets taken from UCI Machine Learning Repository in experiments. We compare the proposed method with SMOTE and other extension version including Borderline SMOTE1(B1-SMOTE), Safe-Level SMOTE(SL-SMOTE), Local-Neighborhood SMOTE(LN-SMOTE), and ADASYN. The result shows that the proposed method achieve better classifier performance for the minority class than other methods after examined the data sets with C4.5 decision trees.

APA, Harvard, Vancouver, ISO, and other styles

24

Foroozandeh, Mehdi. "GAN-Based Synthesis of Brain Tumor Segmentation Data : Augmenting a dataset by generating artificial images." Thesis, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-169863.

Full text

Abstract:

Machine learning applications within medical imaging often suffer from a lack of data, as a consequence of restrictions that hinder the free distribution of patient information. In this project, GANs (generative adversarial networks) are used to generate data synthetically, in an effort to circumvent this issue. The GAN framework PGAN is trained on the brain tumor segmentation dataset BraTS to generate new, synthetic brain tumor masks with the same visual characteristics as the real samples. The image-to-image translation network SPADE is subsequently trained on the image pairs in the real dataset, to learn a transformation from segmentation masks to brain MR images, and is in turn used to map the artificial segmentation masks generated by PGAN to corresponding artificial MR images. The images generated by these networks form a new, synthetic dataset, which is used to augment the original dataset. Different quantities of real and synthetic data are then evaluated in three different brain tumor segmentation tasks, where the image segmentation network U-Net is trained on this data to segment (real) MR images into the classes in question. The final segmentation performance of each training instance is evaluated over test data from the real dataset with the Weighted Dice Loss metric. The results indicate a slight increase in performance across all segmentation tasks evaluated in this project, when including some quantity of synthetic images. However, the differences were largest when the experiments were restricted to using only 20 % of the real data, and less significant when the full dataset was made available. A majority of the generated segmentation masks appear visually convincing to an extent (although somewhat noisy with regards to the intra-tumoral classes), while a relatively large proportion appear heavily noisy and corrupted. However, the translation of segmentation masks to MR images via SPADE proved more reliable and consistent.

APA, Harvard, Vancouver, ISO, and other styles

25

Dale, Ashley S. "3D Object Detection Using Virtual Environment Assisted Deep Network Training." Thesis, 2020. http://hdl.handle.net/1805/24756.

Full text

Abstract:

Indiana University-Purdue University Indianapolis (IUPUI)
An RGBZ synthetic dataset consisting of five object classes in a variety of virtual environments and orientations was combined with a small sample of real-world image data and used to train the Mask R-CNN (MR-CNN) architecture in a variety of configurations. When the MR-CNN architecture was initialized with MS COCO weights and the heads were trained with a mix of synthetic data and real world data, F1 scores improved in four of the five classes: The average maximum F1-score of all classes and all epochs for the networks trained with synthetic data is F1∗ = 0.91, compared to F1 = 0.89 for the networks trained exclusively with real data, and the standard deviation of the maximum mean F1-score for synthetically trained networks is σ∗ = 0.015, compared to σ_F1 = 0.020 for the networks trained exclusively with real F1 data. Various backgrounds in synthetic data were shown to have negligible impact on F1 scores, opening the door to abstract backgrounds and minimizing the need for intensive synthetic data fabrication. When the MR-CNN architecture was initialized with MS COCO weights and depth data was included in the training data, the net- work was shown to rely heavily on the initial convolutional input to feed features into the network, the image depth channel was shown to influence mask generation, and the image color channels were shown to influence object classification. A set of latent variables for a subset of the synthetic datatset was generated with a Variational Autoencoder then analyzed using Principle Component Analysis and Uniform Manifold Projection and Approximation (UMAP). The UMAP analysis showed no meaningful distinction between real-world and synthetic data, and a small bias towards clustering based on image background.

APA, Harvard, Vancouver, ISO, and other styles

26

(8771429), Ashley S. Dale. "3D OBJECT DETECTION USING VIRTUAL ENVIRONMENT ASSISTED DEEP NETWORK TRAINING." Thesis, 2021.

Find full text

Abstract:

An RGBZ synthetic dataset consisting of five object classes in a variety of virtual environments and orientations was combined with a small sample of real-world image data and used to train the Mask R-CNN (MR-CNN) architecture in a variety of configurations. When the MR-CNN architecture was initialized with MS COCO weights and the heads were trained with a mix of synthetic data and real world data, F1 scores improved in four of the five classes: The average maximum F1-score of all classes and all epochs for the networks trained with synthetic data is F1∗ = 0.91, compared to F1 = 0.89 for the networks trained exclusively with real data, and the standard deviation of the maximum mean F1-score for synthetically trained networks is σ∗ _F1= 0.015, compared to σF 1 = 0.020 for the networks trained exclusively with real data. Various backgrounds in synthetic data were shown to have negligible impact on F1 scores, opening the door to abstract backgrounds and minimizing the need for intensive synthetic data fabrication. When the MR-CNN architecture was initialized with MS COCO weights and depth data was included in the training data, the net- work was shown to rely heavily on the initial convolutional input to feed features into the network, the image depth channel was shown to influence mask generation, and the image color channels were shown to influence object classification. A set of latent variables for a subset of the synthetic datatset was generated with a Variational Autoencoder then analyzed using Principle Component Analysis and Uniform Manifold Projection and Approximation (UMAP). The UMAP analysis showed no meaningful distinction between real-world and synthetic data, and a small bias towards clustering based on image background.

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!