Journal articles on the topic 'Synthetic datasets'

To see the other types of publications on this topic, follow the link: Synthetic datasets.

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Synthetic datasets.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Hanel, A., D. Kreuzpaintner, and U. Stilla. "EVALUATION OF A TRAFFIC SIGN DETECTOR BY SYNTHETIC IMAGE DATA FOR ADVANCED DRIVER ASSISTANCE SYSTEMS." ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences XLII-2 (May 30, 2018): 425–32. http://dx.doi.org/10.5194/isprs-archives-xlii-2-425-2018.

Full text
Abstract:
Recently, several synthetic image datasets of street scenes have been published. These datasets contain various traffic signs and can therefore be used to train and test machine learning-based traffic sign detectors. In this contribution, selected datasets are compared regarding ther applicability for traffic sign detection. The comparison covers the process to produce the synthetic images and addresses the virtual worlds, needed to produce the synthetic images, and their environmental conditions. The comparison covers variations in the appearance of traffic signs and the labeling strategies used for the datasets, as well. A deep learning traffic sign detector is trained with multiple training datasets with different ratios between synthetic and real training samples to evaluate the synthetic SYNTHIA dataset. A test of the detector on real samples only has shown that an overall accuracy and ROC AUC of more than 95 % can be achieved for both a small rate of synthetic samples and a large rate of synthetic samples in the training dataset.
APA, Harvard, Vancouver, ISO, and other styles
2

Arvanitis, Theodoros N., Sean White, Stuart Harrison, Rupert Chaplin, and George Despotou. "A method for machine learning generation of realistic synthetic datasets for validating healthcare applications." Health Informatics Journal 28, no. 2 (January 2022): 146045822210770. http://dx.doi.org/10.1177/14604582221077000.

Full text
Abstract:
Digital health applications can improve quality and effectiveness of healthcare, by offering a number of new tools to users, which are often considered a medical device. Assuring their safe operation requires, amongst others, clinical validation, needing large datasets to test them in realistic clinical scenarios. Access to datasets is challenging, due to patient privacy concerns. Development of synthetic datasets is seen as a potential alternative. The objective of the paper is the development of a method for the generation of realistic synthetic datasets, statistically equivalent to real clinical datasets, and demonstrate that the Generative Adversarial Network (GAN) based approach is fit for purpose. A generative adversarial network was implemented and trained, in a series of six experiments, using numerical and categorical variables, including ICD-9 and laboratory codes, from three clinically relevant datasets. A number of contextual steps provided the success criteria for the synthetic dataset. A synthetic dataset that exhibits very similar statistical characteristics with the real dataset was generated. Pairwise association of variables is very similar. A high degree of Jaccard similarity and a successful K-S test further support this. The proof of concept of generating realistic synthetic datasets was successful, with the approach showing promise for further work.
APA, Harvard, Vancouver, ISO, and other styles
3

Kannan, Subarmaniam. "Synthetic time series data generation for edge analytics." F1000Research 11 (January 20, 2022): 67. http://dx.doi.org/10.12688/f1000research.72984.1.

Full text
Abstract:
Background: Internet of Things (IoT) edge analytics enables data computation and storage to be available adjacent to the source of data generation at the IoT system. This method improves sensor data handling and speeds up analysis, prediction, and action. Using machine learning for analytics and task offloading in edge servers could minimise latency and energy usage. However, one of the key challenges in using machine learning in edge analytics is to find a real-world dataset to implement a more representative predictive model. This challenge has undeniably slowed down the adoption of machine learning methods in IoT edge analytics. Thus, the generation of realistic synthetic datasets can leverage the need to speed up methodological use of machine learning in edge analytics. Methods: We create synthetic data with features that are like data from IoT devices. We use an existing air quality dataset that includes temperature and gas sensor measurements. This real-time dataset includes component values for the Air Quality Index (AQI) and ppm concentrations for various polluting gases. We build a JavaScript Object Notation (JSON) model to capture the distribution of variables and the structure of this real dataset to generate the synthetic data. Based on the synthetic dataset and original dataset, we create a comparative predictive model. Results: Analysis of synthetic dataset predictive model shows that it can be successfully used for edge analytics purposes, replacing real-world datasets. There is no significant difference between the real-world dataset compared the synthetic dataset. The generated synthetic data requires no modification to suit the edge computing requirements. Conclusions: The framework can generate representative synthetic datasets based on JSON schema attributes. The accuracy, precision, and recall values for the real and synthetic datasets indicate that the logistic regression model is capable of successfully classifying data.
APA, Harvard, Vancouver, ISO, and other styles
4

Poudevigne-Durance, Thomas, Owen Dafydd Jones, and Yipeng Qin. "MaWGAN: A Generative Adversarial Network to Create Synthetic Data from Datasets with Missing Data." Electronics 11, no. 6 (March 8, 2022): 837. http://dx.doi.org/10.3390/electronics11060837.

Full text
Abstract:
The creation of synthetic data are important for a range of applications, for example, to anonymise sensitive datasets or to increase the volume of data in a dataset. When the target dataset has missing data, then it is common to just discard incomplete observations, even though this necessarily means some loss of information. However, when the proportion of missing data are large, discarding incomplete observations may not leave enough data to accurately estimate their joint distribution. Thus, there is a need for data synthesis methods capable of using datasets with missing data, to improve accuracy and, in more extreme cases, to make data synthesis possible. To achieve this, we propose a novel generative adversarial network (GAN) called MaWGAN (for masked Wasserstein GAN), which creates synthetic data directly from datasets with missing values. As with existing GAN approaches, the MaWGAN synthetic data generator generates samples from the full joint distribution. We introduce a novel methodology for comparing the generator output with the original data that does not require us to discard incomplete observations, based on a modification of the Wasserstein distance and easily implemented using masks generated from the pattern of missing data in the original dataset. Numerical experiments are used to demonstrate the superior performance of MaWGAN compared to (a) discarding incomplete observations before using a GAN, and (b) imputing missing values (using the GAIN algorithm) before using a GAN.
APA, Harvard, Vancouver, ISO, and other styles
5

So, Banghee, Jean-Philippe Boucher, and Emiliano A. Valdez. "Synthetic Dataset Generation of Driver Telematics." Risks 9, no. 4 (March 24, 2021): 58. http://dx.doi.org/10.3390/risks9040058.

Full text
Abstract:
This article describes the techniques employed in the production of a synthetic dataset of driver telematics emulated from a similar real insurance dataset. The synthetic dataset generated has 100,000 policies that included observations regarding driver’s claims experience, together with associated classical risk variables and telematics-related variables. This work is aimed to produce a resource that can be used to advance models to assess risks for usage-based insurance. It follows a three-stage process while using machine learning algorithms. In the first stage, a synthetic portfolio of the space of feature variables is generated applying an extended SMOTE algorithm. The second stage is simulating values for the number of claims as multiple binary classifications applying feedforward neural networks. The third stage is simulating values for aggregated amount of claims as regression using feedforward neural networks, with number of claims included in the set of feature variables. The resulting dataset is evaluated by comparing the synthetic and real datasets when Poisson and gamma regression models are fitted to the respective data. Other visualization and data summarization produce remarkable similar statistics between the two datasets. We hope that researchers interested in obtaining telematics datasets to calibrate models or learning algorithms will find our work ot be valuable.
APA, Harvard, Vancouver, ISO, and other styles
6

Wu, Hao, Yue Ning, Prithwish Chakraborty, Jilles Vreeken, Nikolaj Tatti, and Naren Ramakrishnan. "Generating Realistic Synthetic Population Datasets." ACM Transactions on Knowledge Discovery from Data 12, no. 4 (July 13, 2018): 1–22. http://dx.doi.org/10.1145/3182383.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Minhas, Saad, Zeba Khanam, Shoaib Ehsan, Klaus McDonald-Maier, and Aura Hernández-Sabaté. "Weather Classification by Utilizing Synthetic Data." Sensors 22, no. 9 (April 21, 2022): 3193. http://dx.doi.org/10.3390/s22093193.

Full text
Abstract:
Weather prediction from real-world images can be termed a complex task when targeting classification using neural networks. Moreover, the number of images throughout the available datasets can contain a huge amount of variance when comparing locations with the weather those images are representing. In this article, the capabilities of a custom built driver simulator are explored specifically to simulate a wide range of weather conditions. Moreover, the performance of a new synthetic dataset generated by the above simulator is also assessed. The results indicate that the use of synthetic datasets in conjunction with real-world datasets can increase the training efficiency of the CNNs by as much as 74%. The article paves a way forward to tackle the persistent problem of bias in vision-based datasets.
APA, Harvard, Vancouver, ISO, and other styles
8

Zhang, Jie, Xinyan Qin, Jin Lei, Bo Jia, Bo Li, Zhaojun Li, Huidong Li, Yujie Zeng, and Jie Song. "A Novel Auto-Synthesis Dataset Approach for Fitting Recognition Using Prior Series Data." Sensors 22, no. 12 (June 9, 2022): 4364. http://dx.doi.org/10.3390/s22124364.

Full text
Abstract:
To address power transmission line (PTL) traversing complex environments leading to data collection being difficult and costly, we propose a novel auto-synthesis dataset approach for fitting recognition using prior series data. The approach mainly includes three steps: (1) formulates synthesis rules by the prior series data; (2) renders 2D images based on the synthesis rules utilizing advanced virtual 3D techniques; (3) generates the synthetic dataset with images and annotations obtained by processing images using the OpenCV. The trained model using the synthetic dataset was tested by the real dataset (including images and annotations) with a mean average precision (mAP) of 0.98, verifying the feasibility and effectiveness of the proposed approach. The recognition accuracy by the test is comparable with training by real samples and the cost is greatly reduced to generate synthetic datasets. The proposed approach improves the efficiency of establishing a dataset, providing a training data basis for deep learning (DL) of fitting recognition.
APA, Harvard, Vancouver, ISO, and other styles
9

Kugurakova, Vlada Vladimirovna, Vitaly Denisovich Abramov, Daniil Ivanovich Kostiuk, Regina Airatovna Sharaeva, Rim Radikovich Gazizova, and Murad Rustemovich Khafizov. "Generation of Three-Dimensional Synthetic Datasets." Russian Digital Libraries Journal 24, no. 4 (September 12, 2021): 622–52. http://dx.doi.org/10.26907/1562-5419-2021-24-4-622-652.

Full text
Abstract:
The work is devoted to the description of the process of developing a universal toolkit for generating synthetic data for training various neural networks. The approach used has shown its success and effectiveness in solving various problems, in particular, training a neural network to recognize shopping behavior inside stores through surveillance cameras and training a neural network for recognizing spaces with augmented reality devices without using auxiliary infrared cameras. Generalizing conclusions allow planning the further development of technologies for generating three-dimensional synthetic data.
APA, Harvard, Vancouver, ISO, and other styles
10

Ma’sum, Muhammad Anwar. "Intelligent Clustering and Dynamic Incremental Learning to Generate Multi-Codebook Fuzzy Neural Network for Multi-Modal Data Classification." Symmetry 12, no. 4 (April 24, 2020): 679. http://dx.doi.org/10.3390/sym12040679.

Full text
Abstract:
Classification in multi-modal data is one of the challenges in the machine learning field. The multi-modal data need special treatment as its features are distributed in several areas. This study proposes multi-codebook fuzzy neural networks by using intelligent clustering and dynamic incremental learning for multi-modal data classification. In this study, we utilized intelligent K-means clustering based on anomalous patterns and intelligent K-means clustering based on histogram information. In this study, clustering is used to generate codebook candidates before the training process, while incremental learning is utilized when the condition to generate a new codebook is sufficient. The condition to generate a new codebook in incremental learning is based on the similarity of the winner class and other classes. The proposed method was evaluated in synthetic and benchmark datasets. The experiment results showed that the proposed multi-codebook fuzzy neural networks that use dynamic incremental learning have significant improvements compared to the original fuzzy neural networks. The improvements were 15.65%, 5.31% and 11.42% on the synthetic dataset, the benchmark dataset, and the average of all datasets, respectively, for incremental version 1. The incremental learning version 2 improved by 21.08% 4.63%, and 14.35% on the synthetic dataset, the benchmark dataset, and the average of all datasets, respectively. The multi-codebook fuzzy neural networks that use intelligent clustering also had significant improvements compared to the original fuzzy neural networks, achieving 23.90%, 2.10%, and 15.02% improvements on the synthetic dataset, the benchmark dataset, and the average of all datasets, respectively.
APA, Harvard, Vancouver, ISO, and other styles
11

Zhang, Runfei, Peiqi Yang, Shouyang Liu, Caihong Wang, and Jing Liu. "Evaluation of the Methods for Estimating Leaf Chlorophyll Content with SPAD Chlorophyll Meters." Remote Sensing 14, no. 20 (October 14, 2022): 5144. http://dx.doi.org/10.3390/rs14205144.

Full text
Abstract:
Leaf chlorophyll content (LCC) is an indicator of leaf photosynthetic capacity. It is crucial for improving the understanding of plant physiological status. SPAD meters are routinely used to provide an instantaneous estimation of in situ LCC. However, the calibration of meter readings into absolute measures of LCC is difficult, and a generic approach for this conversion remains elusive. This study presents an evaluation of the approaches that are commonly used in converting SPAD readings into absolute LCC values. We compared these approaches using three field datasets and one synthetic dataset. The field datasets consist of LCC measured using a destructive method in the laboratory, as well as the SPAD readings measured in the field for various vegetation types. The synthetic dataset was generated with the leaf radiative transfer model PROSPECT-5 across different leaf structures. LCC covers a wide range from 1.40 μg cm−2 to 86.34 μg cm−2 in the field datasets, and it ranges from 5 μg cm−2 to 80 μg cm−2 in the synthetic dataset. The relationships between LCC and SPAD readings were examined using linear, polynomial, exponential, and homographic functions for the field and synthetic datasets. For the field datasets, the assessments of these approaches were conducted for (i) all three datasets together, (ii) individual datasets, and (iii) individual vegetation species. For the synthetic dataset, leaves with different leaf structures (which mimic different vegetation species) were grouped for the evaluation of the approaches. The results demonstrate that the linear function is the most accurate one for the simulated dataset, in which leaf structure is relatively simple due to the turbid medium assumption of the PROSPECT-5 model. The assumption of leaves in the PROSPECT-5 model complies with the assumption made in the designed algorithm of the SPAD meter. As a result, the linear relationship between LCC and SPAD values was found for the modeled dataset in which the leaf structure is simple. For the field dataset, the functions do not perform well for all datasets together, while they improve significantly for individual datasets or species. The overall performance of the linear (LCC=a*SPAD+b), polynomial (LCC=a*SPAD2+b*SPAD+c), and exponential functions (LCC=0.0893*10SPADα) is promising for various datasets and species with the R2 > 0.8 and RMSE <10 μg cm−2. However, the accuracy of the homographic functions (LCC=a*SPAD/b−SPAD) changes significantly among different datasets and species with R2 from 0.02 of wheat to 0.92 of linseed (RMSE from 642.50 μg cm−2 to 5.74 μg cm−2). Other than species- and dataset-dependence, the homographic functions are more likely to produce a numerical singularity due to the characteristics of the function per se. Compared with the linear and exponential functions, the polynomial functions have a higher degree of freedom due to one extra fitting parameter. For a smaller size of data, the linear and exponential functions are more suitable than the polynomial functions due to the less fitting parameters. This study compares different approaches and addresses the uncertainty in the conversion from SPAD readings into absolute LCC, which facilitates more accurate measurements of absolute LCC in the field.
APA, Harvard, Vancouver, ISO, and other styles
12

Sa, Inkyu, Jongyoon Lim, Hoseok Ahn, and Bruce MacDonald. "deepNIR: Datasets for Generating Synthetic NIR Images and Improved Fruit Detection System Using Deep Learning Techniques." Sensors 22, no. 13 (June 22, 2022): 4721. http://dx.doi.org/10.3390/s22134721.

Full text
Abstract:
This paper presents datasets utilised for synthetic near-infrared (NIR) image generation and bounding-box level fruit detection systems. A high-quality dataset is one of the essential building blocks that can lead to success in model generalisation and the deployment of data-driven deep neural networks. In particular, synthetic data generation tasks often require more training samples than other supervised approaches. Therefore, in this paper, we share the NIR+RGB datasets that are re-processed from two public datasets (i.e., nirscene and SEN12MS), expanded our previous study, deepFruits, and our novel NIR+RGB sweet pepper (capsicum) dataset. We oversampled from the original nirscene dataset at 10, 100, 200, and 400 ratios that yielded a total of 127kpairs of images. From the SEN12MS satellite multispectral dataset, we selected Summer (45k) and All seasons (180k) subsets and applied a simple yet important conversion: digital number (DN) to pixel value conversion followed by image standardisation. Our sweet pepper dataset consists of 1615 pairs of NIR+RGB images that were collected from commercial farms. We quantitatively and qualitatively demonstrate that these NIR+RGB datasets are sufficient to be used for synthetic NIR image generation. We achieved Frechet inception distances (FIDs) of 11.36, 26.53, and 40.15 for nirscene1, SEN12MS, and sweet pepper datasets, respectively. In addition, we release manual annotations of 11 fruit bounding boxes that can be exported in various formats using cloud service. Four newly added fruits (blueberry, cherry, kiwi and wheat) compound 11 novel bounding box datasets on top of our previous work presented in the deepFruits project (apple, avocado, capsicum, mango, orange, rockmelon and strawberry). The total number of bounding box instances of the dataset is 162kand it is ready to use from a cloud service. For the evaluation of the dataset, Yolov5 single stage detector is exploited and reported impressive mean-average-precision, mAP[0.5:0.95] results of min:0.49, max:0.812. We hope these datasets are useful and serve as a baseline for future studies.
APA, Harvard, Vancouver, ISO, and other styles
13

Albuquerque, G., T. Lowe, and M. Magnor. "Synthetic Generation of High-Dimensional Datasets." IEEE Transactions on Visualization and Computer Graphics 17, no. 12 (December 2011): 2317–24. http://dx.doi.org/10.1109/tvcg.2011.237.

Full text
APA, Harvard, Vancouver, ISO, and other styles
14

He, Boyong, Xianjiang Li, Bo Huang, Enhui Gu, Weijie Guo, and Liaoni Wu. "UnityShip: A Large-Scale Synthetic Dataset for Ship Recognition in Aerial Images." Remote Sensing 13, no. 24 (December 9, 2021): 4999. http://dx.doi.org/10.3390/rs13244999.

Full text
Abstract:
As a data-driven approach, deep learning requires a large amount of annotated data for training to obtain a sufficiently accurate and generalized model, especially in the field of computer vision. However, when compared with generic object recognition datasets, aerial image datasets are more challenging to acquire and more expensive to label. Obtaining a large amount of high-quality aerial image data for object recognition and image understanding is an urgent problem. Existing studies show that synthetic data can effectively reduce the amount of training data required. Therefore, in this paper, we propose the first synthetic aerial image dataset for ship recognition, called UnityShip. This dataset contains over 100,000 synthetic images and 194,054 ship instances, including 79 different ship models in ten categories and six different large virtual scenes with different time periods, weather environments, and altitudes. The annotations include environmental information, instance-level horizontal bounding boxes, oriented bounding boxes, and the type and ID of each ship. This provides the basis for object detection, oriented object detection, fine-grained recognition, and scene recognition. To investigate the applications of UnityShip, the synthetic data were validated for model pre-training and data augmentation using three different object detection algorithms and six existing real-world ship detection datasets. Our experimental results show that for small-sized and medium-sized real-world datasets, the synthetic data achieve an improvement in model pre-training and data augmentation, showing the value and potential of synthetic data in aerial image recognition and understanding tasks.
APA, Harvard, Vancouver, ISO, and other styles
15

Priswanto, Budi, and Handri Santoso. "CycleGAN and SRGAN to Enrich the Dataset." SinkrOn 7, no. 2 (April 18, 2022): 495–503. http://dx.doi.org/10.33395/sinkron.v7i2.11384.

Full text
Abstract:
When developments in the field of computer science are growing rapidly. For example, the development of image or video predictions for various fields has been widely applied to assist further processes. The field of computer vision has created many ideas about processing using deep learning algorithms. Sometimes the problem with using deep learning or machine learning is in the availability of the dataset or the unavailability of the dataset. Various methods are used to add to or enrich the dataset. One way is to add an image dataset by creating a synthetic image. One of the well-known algorithms is Generative Adversarial Networks as an algorithm for generating synthetic images. Currently, there are many variations of the GAN to around 500 variants. This research is to utilize the Cycle GAN architecture in order to enrich the dataset. By doing GAN as a synthetic image generator. This is very important in procuring image datasets, for training and testing models of Deep Learning algorithms such as Convolutional Neural Networks. In addition, the use of synthetic images produces a deep learning model to avoid overfitting. One of the causes of the overfitting problem is the lack of datasets. There are many ways to add image datasets, by cropping, continuously rotating 90 degrees, 180 degrees. The reason for using Cycle Generative Adversarial Networks is because this method is not as complicated as other GANs, but also not as simple. Cycle GAN synthetic images are processed with Super Resolution GAN, which aims to clarify image quality. So that it produces a different image and good image quality.
APA, Harvard, Vancouver, ISO, and other styles
16

Priswanto, Budi, and Handri Santoso. "CycleGAN and SRGAN to Enrich the Dataset." SinkrOn 7, no. 2 (April 18, 2022): 495–503. http://dx.doi.org/10.33395/sinkron.v7i2.11384.

Full text
Abstract:
When developments in the field of computer science are growing rapidly. For example, the development of image or video predictions for various fields has been widely applied to assist further processes. The field of computer vision has created many ideas about processing using deep learning algorithms. Sometimes the problem with using deep learning or machine learning is in the availability of the dataset or the unavailability of the dataset. Various methods are used to add to or enrich the dataset. One way is to add an image dataset by creating a synthetic image. One of the well-known algorithms is Generative Adversarial Networks as an algorithm for generating synthetic images. Currently, there are many variations of the GAN to around 500 variants. This research is to utilize the Cycle GAN architecture in order to enrich the dataset. By doing GAN as a synthetic image generator. This is very important in procuring image datasets, for training and testing models of Deep Learning algorithms such as Convolutional Neural Networks. In addition, the use of synthetic images produces a deep learning model to avoid overfitting. One of the causes of the overfitting problem is the lack of datasets. There are many ways to add image datasets, by cropping, continuously rotating 90 degrees, 180 degrees. The reason for using Cycle Generative Adversarial Networks is because this method is not as complicated as other GANs, but also not as simple. Cycle GAN synthetic images are processed with Super Resolution GAN, which aims to clarify image quality. So that it produces a different image and good image quality.
APA, Harvard, Vancouver, ISO, and other styles
17

CHELIOTIS, Kostas. "Using synthetic data for the dissemination of computational geospatial models." European Journal of Geography 11, no. 3 (December 13, 2020): 76–91. http://dx.doi.org/10.48088/ejg.k.che.11.3.76.91.

Full text
Abstract:
Detailed datasets of real-world systems are becoming more and more available, accompanied by a similar increased use in research. However, datasets are often provided to researchers with restrictions regarding their publication. This poses a major limitation for the dissemination of computational tools, whose comprehension often requires the availability of the detailed dataset around which the tool was built. This paper discusses the potential of synthetic datasets for circumventing such limitations, as it is often the data content itself that is proprietary, rather than the dataset schema. Therefore, new data can be generated that conform to the schema, and may then be distributed freely alongside the relevant models, allowing other researchers to explore tools in action to their full extent. This paper presents the process of creating synthetic geospatial data within the scope of a research project which relied on real-world data, originally captured through close collaboration with industry partners.
APA, Harvard, Vancouver, ISO, and other styles
18

Cheliotis, Kostas. "Using synthetic data for the dissemination of computational geospatial models." European Journal of Geography 11, no. 4 (December 16, 2020): 6–21. http://dx.doi.org/10.48088/ejg.k.che.11.4.06.21.

Full text
Abstract:
Detailed datasets of real-world systems are becoming more and more available, accompanied by a similar increased use in research. However, datasets are often provided to researchers with restrictions regarding their publication. This poses a major limitation for the dissemination of computational tools, whose comprehension often requires the availability of the detailed dataset around which the tool was built. This paper discusses the potential of synthetic datasets for circumventing such limitations, as it is often the data content itself that is proprietary, rather than the dataset schema. Therefore, new data can be generated that conform to the schema, and may then be distributed freely alongside the relevant models, allowing other researchers to explore tools in action to their full extent. This paper presents the process of creating synthetic geospatial data within the scope of a research project which relied on real-world data, originally captured through close collaboration with industry partners.
APA, Harvard, Vancouver, ISO, and other styles
19

Traynor, Carlos, Tarjinder Sahota, Helen Tomkinson, Ignacio Gonzalez-Garcia, Neil Evans, and Michael Chappell. "Imputing Biomarker Status from RWE Datasets—A Comparative Study." Journal of Personalized Medicine 11, no. 12 (December 13, 2021): 1356. http://dx.doi.org/10.3390/jpm11121356.

Full text
Abstract:
Missing data is a universal problem in analysing Real-World Evidence (RWE) datasets. In RWE datasets, there is a need to understand which features best correlate with clinical outcomes. In this context, the missing status of several biomarkers may appear as gaps in the dataset that hide meaningful values for analysis. Imputation methods are general strategies that replace missing values with plausible values. Using the Flatiron NSCLC dataset, including more than 35,000 subjects, we compare the imputation performance of six such methods on missing data: predictive mean matching, expectation-maximisation, factorial analysis, random forest, generative adversarial networks and multivariate imputations with tabular networks. We also conduct extensive synthetic data experiments with structural causal models. Statistical learning from incomplete datasets should select an appropriate imputation algorithm accounting for the nature of missingness, the impact of missing data, and the distribution shift induced by the imputation algorithm. For our synthetic data experiments, tabular networks had the best overall performance. Methods using neural networks are promising for complex datasets with non-linearities. However, conventional methods such as predictive mean matching work well for the Flatiron NSCLC biomarker dataset.
APA, Harvard, Vancouver, ISO, and other styles
20

Gordon, Ben, Clara Fennessy, Susheel Varma, Jake Barrett, Enez McCondochie, Trevor Heritage, Oenone Duroe, et al. "Evaluation of freely available data profiling tools for health data research application: a functional evaluation review." BMJ Open 12, no. 5 (May 2022): e054186. http://dx.doi.org/10.1136/bmjopen-2021-054186.

Full text
Abstract:
ObjectivesTo objectively evaluate freely available data profiling software tools using healthcare data.DesignData profiling tools were evaluated for their capabilities using publicly available information and data sheets. From initial assessment, several underwent further detailed evaluation for application on healthcare data using a synthetic dataset of 1000 patients and associated data using a common health data model, and tools scored based on their functionality with this dataset.SettingImproving the quality of healthcare data for research use is a priority. Profiling tools can assist by evaluating datasets across a range of quality dimensions. Several freely available software packages with profiling capabilities are available but healthcare organisations often have limited data engineering capability and expertise.Participants28 profiling tools, 8 undergoing evaluation on synthetic dataset of 1000 patients.ResultsOf 28 potential profiling tools initially identified, 8 showed high potential for applicability with healthcare datasets based on available documentation, of which two performed consistently well for these purposes across multiple tasks including determination of completeness, consistency, uniqueness, validity, accuracy and provision of distribution metrics.ConclusionsNumerous freely available profiling tools are serviceable for potential use with health datasets, of which at least two demonstrated high performance across a range of technical data quality dimensions based on testing with synthetic health dataset and common data model. The appropriate tool choice depends on factors including underlying organisational infrastructure, level of data engineering and coding expertise, but there are freely available tools helping profile health datasets for research use and inform curation activity.
APA, Harvard, Vancouver, ISO, and other styles
21

Burmakova, Anastasiya, and Diana Kalibatienė. "Applying Fuzzy Inference and Machine Learning Methods for Prediction with a Small Dataset: A Case Study for Predicting the Consequences of Oil Spills on a Ground Environment." Applied Sciences 12, no. 16 (August 18, 2022): 8252. http://dx.doi.org/10.3390/app12168252.

Full text
Abstract:
Applying machine learning (ML) and fuzzy inference systems (FIS) requires large datasets to obtain more accurate predictions. However, in the cases of oil spills on ground environments, only small datasets are available. Therefore, this research aims to assess the suitability of ML techniques and FIS for the prediction of the consequences of oil spills on ground environments using small datasets. Consequently, we present a hybrid approach for assessing the suitability of ML (Linear Regression, Decision Trees, Support Vector Regression, Ensembles, and Gaussian Process Regression) and the adaptive neural fuzzy inference system (ANFIS) for predicting the consequences of oil spills with a small dataset. This paper proposes enlarging the initial small dataset of an oil spill on a ground environment by using the synthetic data generated by applying a mathematical model. ML techniques and ANFIS were tested with the same generated synthetic datasets to assess the proposed approach. The proposed ANFIS-based approach shows significant performance and sufficient efficiency for predicting the consequences of oil spills on ground environments with a smaller dataset than the applied ML techniques. The main finding of this paper indicates that FIS is suitable for prediction with a small dataset and provides sufficiently accurate prediction results.
APA, Harvard, Vancouver, ISO, and other styles
22

Son, Guk-Jin, Dong-Hoon Kwak, Mi-Kyung Park, Young-Duk Kim, and Hee-Chul Jung. "U-Net-Based Foreign Object Detection Method Using Effective Image Acquisition System: A Case of Almond and Green Onion Flake Food Process." Sustainability 13, no. 24 (December 14, 2021): 13834. http://dx.doi.org/10.3390/su132413834.

Full text
Abstract:
Supervised deep learning-based foreign object detection algorithms are tedious, costly, and time-consuming because they usually require a large number of training datasets and annotations. These disadvantages make them frequently unsuitable for food quality evaluation and food manufacturing processes. However, the deep learning-based foreign object detection algorithm is an effective method to overcome the disadvantages of conventional foreign object detection methods mainly used in food inspection. For example, color sorter machines cannot detect foreign objects with a color similar to food, and the performance is easily degraded by changes in illuminance. Therefore, to detect foreign objects, we use a deep learning-based foreign object detection algorithm (model). In this paper, we present a synthetic method to efficiently acquire a training dataset of deep learning that can be used for food quality evaluation and food manufacturing processes. Moreover, we perform data augmentation using color jitter on a synthetic dataset and show that this approach significantly improves the illumination invariance features of the model trained on synthetic datasets. The F1-score of the model that trained the synthetic dataset of almonds at 360 lux illumination intensity achieved a performance of 0.82, similar to the F1-score of the model that trained the real dataset. Moreover, the F1-score of the model trained with the real dataset combined with the synthetic dataset achieved better performance than the model trained with the real dataset in the change of illumination. In addition, compared with the traditional method of using color sorter machines to detect foreign objects, the model trained on the synthetic dataset has obvious advantages in accuracy and efficiency. These results indicate that the synthetic dataset not only competes with the real dataset, but they also complement each other.
APA, Harvard, Vancouver, ISO, and other styles
23

Maack, Lennart, Lennart Holstein, and Alexander Schlaefer. "GANs for generation of synthetic ultrasound images from small datasets." Current Directions in Biomedical Engineering 8, no. 1 (July 1, 2022): 17–20. http://dx.doi.org/10.1515/cdbme-2022-0005.

Full text
Abstract:
Abstract The task of medical image classification is increasingly supported by algorithms. Deep learning methods like convolutional neural networks (CNNs) show superior performance in medical image analysis but need a high-quality training dataset with a large number of annotated samples. Particularly in the medical domain, the availability of such datasets is rare due to data privacy or the lack of data sharing practices among institutes. Generative adversarial networks (GANs) are able to generate high quality synthetic images. This work investigates the capabilities of different state-of-the-art GAN architectures in generating realistic breast ultrasound images if only a small amount of training data is available. In a second step, these synthetic images are used to augment the real ultrasound image dataset utilized for training CNNs. The training of both GANs and CNNs is conducted with systematically reduced dataset sizes. The GAN architectures are capable of generating realistic ultrasound images. GANs using data augmentation techniques outperform the baseline Style- GAN2 with respect to the Frechet Inception distance by up to 64.2%. CNN models trained with additional synthetic data outperform the baseline CNN model using only real data for training by up to 15.3% with respect to the F1 score, especially for datasets containing less than 100 images. As a conclusion, GANs can successfully be used to generate synthetic ultrasound images of high quality and diversity, improve classification performance of CNNs and thus provide a benefit to computer-aided diagnostics.
APA, Harvard, Vancouver, ISO, and other styles
24

GÜMÜŞ, İbrahim Halil, and Serkan GÜLDAL. "Tıbbi Verilerde Heinz Ortalamasına Dayalı Yeni Sentetik Veriler Üreterek Veri Kümesini Dengeleme." Afyon Kocatepe University Journal of Sciences and Engineering 22, no. 3 (June 30, 2022): 570–76. http://dx.doi.org/10.35414/akufemubid.1011058.

Full text
Abstract:
Advances in science and technology have caused data sizes to increase at a great rate. Thus, unbalanced data has arisen. A dataset is unbalanced if the classes are not nearly equally represented. In this case, classifying the data causes performance values to decrease because the classification algorithms are developed on the assumption that the datasets are balanced. As the accuracy of the classification favors the majority class, the minority class is often misclassified. The majority of datasets, especially those used in the medical field, have an unbalanced distribution. To balance this distribution, several studies have been performed recently. These studies are undersampling and oversampling processes. In this study, distance and mean based resampling method is used to produce synthetic samples using minority class. For the resampling process, the closest neighbors for all data points belonging to the minority class were determined by using the Euclidean distance. Based on these neighbors and using the Heinz Mean, the desired number of new synthetic samples were formed between each sample to obtain balance. The Random Forest (RF) and Support Vector Machine (SVM) algorithms are used to classify the raw and balanced datasets, and the results were compared. Additionally, the other well known methods (Random Over Sampling-ROS, Random Under Sampling-RUS, and Synthetic Minority Oversampling TEchnique-SMOTE) are compared with the proposed method. It was shown that the balanced dataset using the proposed resampling method increases classification efficiency as compared to the raw dataset and other methods. Accuracy measurements of RF are 0.751 and 0.799 and, accuracy measurements of SVM are 0.762 and 0.781 for raw data and resampled data respectively. Likewise, there are improvements in the other metrics such as Precision, Recall, and F1 Score.
APA, Harvard, Vancouver, ISO, and other styles
25

Mukherjee, Sumit, Yixi Xu, Anusua Trivedi, Nabajyoti Patowary, and Juan L. Ferres. "privGAN: Protecting GANs from membership inference attacks at low cost to utility." Proceedings on Privacy Enhancing Technologies 2021, no. 3 (April 27, 2021): 142–63. http://dx.doi.org/10.2478/popets-2021-0041.

Full text
Abstract:
Abstract Generative Adversarial Networks (GANs) have made releasing of synthetic images a viable approach to share data without releasing the original dataset. It has been shown that such synthetic data can be used for a variety of downstream tasks such as training classifiers that would otherwise require the original dataset to be shared. However, recent work has shown that the GAN models and their synthetically generated data can be used to infer the training set membership by an adversary who has access to the entire dataset and some auxiliary information. Current approaches to mitigate this problem (such as DPGAN [1]) lead to dramatically poorer generated sample quality than the original non–private GANs. Here we develop a new GAN architecture (privGAN), where the generator is trained not only to cheat the discriminator but also to defend membership inference attacks. The new mechanism is shown to empirically provide protection against this mode of attack while leading to negligible loss in downstream performances. In addition, our algorithm has been shown to explicitly prevent memorization of the training set, which explains why our protection is so effective. The main contributions of this paper are: i) we propose a novel GAN architecture that can generate synthetic data in a privacy preserving manner with minimal hyperparameter tuning and architecture selection, ii) we provide a theoretical understanding of the optimal solution of the privGAN loss function, iii) we empirically demonstrate the effectiveness of our model against several white and black–box attacks on several benchmark datasets, iv) we empirically demonstrate on three common benchmark datasets that synthetic images generated by privGAN lead to negligible loss in downstream performance when compared against non– private GANs. While we have focused on benchmarking privGAN exclusively on image datasets, the architecture of privGAN is not exclusive to image datasets and can be easily extended to other types of datasets. Repository link: https://github.com/microsoft/privGAN.
APA, Harvard, Vancouver, ISO, and other styles
26

Thambawita, Vajira, Pegah Salehi, Sajad Amouei Sheshkal, Steven A. Hicks, Hugo L. Hammer, Sravanthi Parasa, Thomas de Lange, Pål Halvorsen, and Michael A. Riegler. "SinGAN-Seg: Synthetic training data generation for medical image segmentation." PLOS ONE 17, no. 5 (May 2, 2022): e0267976. http://dx.doi.org/10.1371/journal.pone.0267976.

Full text
Abstract:
Analyzing medical data to find abnormalities is a time-consuming and costly task, particularly for rare abnormalities, requiring tremendous efforts from medical experts. Therefore, artificial intelligence has become a popular tool for the automatic processing of medical data, acting as a supportive tool for doctors. However, the machine learning models used to build these tools are highly dependent on the data used to train them. Large amounts of data can be difficult to obtain in medicine due to privacy reasons, expensive and time-consuming annotations, and a general lack of data samples for infrequent lesions. In this study, we present a novel synthetic data generation pipeline, called SinGAN-Seg, to produce synthetic medical images with corresponding masks using a single training image. Our method is different from the traditional generative adversarial networks (GANs) because our model needs only a single image and the corresponding ground truth to train. We also show that the synthetic data generation pipeline can be used to produce alternative artificial segmentation datasets with corresponding ground truth masks when real datasets are not allowed to share. The pipeline is evaluated using qualitative and quantitative comparisons between real data and synthetic data to show that the style transfer technique used in our pipeline significantly improves the quality of the generated data and our method is better than other state-of-the-art GANs to prepare synthetic images when the size of training datasets are limited. By training UNet++ using both real data and the synthetic data generated from the SinGAN-Seg pipeline, we show that the models trained on synthetic data have very close performances to those trained on real data when both datasets have a considerable amount of training data. In contrast, we show that synthetic data generated from the SinGAN-Seg pipeline improves the performance of segmentation models when training datasets do not have a considerable amount of data. All experiments were performed using an open dataset and the code is publicly available on GitHub.
APA, Harvard, Vancouver, ISO, and other styles
27

Neuhausen, Marcel, Patrick Herbers, and Markus König. "Using Synthetic Data to Improve and Evaluate the Tracking Performance of Construction Workers on Site." Applied Sciences 10, no. 14 (July 18, 2020): 4948. http://dx.doi.org/10.3390/app10144948.

Full text
Abstract:
Vision-based tracking systems enable the optimization of the productivity and safety management on construction sites by monitoring the workers’ movements. However, training and evaluation of such a system requires a vast amount of data. Sufficient datasets rarely exist for this purpose. We investigate the use of synthetic data to overcome this issue. Using 3D computer graphics software, we model virtual construction site scenarios. These are rendered for the use as a synthetic dataset which augments a self-recorded real world dataset. Our approach is verified by means of a tracking system. For this, we train a YOLOv3 detector identifying pedestrian workers. Kalman filtering is applied to the detections to track them over consecutive video frames. First, the detector’s performance is examined when using synthetic data of various environmental conditions for training. Second, we compare the evaluation results of our tracking system on real world and synthetic scenarios. With an increase of about 7.5 percentage points in mean average precision, our findings show that a synthetic extension is beneficial for otherwise small datasets. The similarity of synthetic and real world results allow for the conclusion that 3D scenes are an alternative to evaluate vision-based tracking systems on hazardous scenes without exposing workers to risks.
APA, Harvard, Vancouver, ISO, and other styles
28

Leng, Mingwei, Jianjun Cheng, Jinjin Wang, Zhengquan Zhang, Hanhai Zhou, and Xiaoyun Chen. "Active Semisupervised Clustering Algorithm with Label Propagation for Imbalanced and Multidensity Datasets." Mathematical Problems in Engineering 2013 (2013): 1–10. http://dx.doi.org/10.1155/2013/641927.

Full text
Abstract:
The accuracy of most of the existing semisupervised clustering algorithms based on small size of labeled dataset is low when dealing with multidensity and imbalanced datasets, and labeling data is quite expensive and time consuming in many real-world applications. This paper focuses on active data selection and semisupervised clustering algorithm in multidensity and imbalanced datasets and proposes an active semisupervised clustering algorithm. The proposed algorithm uses an active mechanism for data selection to minimize the amount of labeled data, and it utilizes multithreshold to expand labeled datasets on multidensity and imbalanced datasets. Three standard datasets and one synthetic dataset are used to demonstrate the proposed algorithm, and the experimental results show that the proposed semisupervised clustering algorithm has a higher accuracy and a more stable performance in comparison to other clustering and semisupervised clustering algorithms, especially when the datasets are multidensity and imbalanced.
APA, Harvard, Vancouver, ISO, and other styles
29

El Emam, Khaled, Lucy Mosquera, and Jason Bass. "Evaluating Identity Disclosure Risk in Fully Synthetic Health Data: Model Development and Validation." Journal of Medical Internet Research 22, no. 11 (November 16, 2020): e23139. http://dx.doi.org/10.2196/23139.

Full text
Abstract:
Background There has been growing interest in data synthesis for enabling the sharing of data for secondary analysis; however, there is a need for a comprehensive privacy risk model for fully synthetic data: If the generative models have been overfit, then it is possible to identify individuals from synthetic data and learn something new about them. Objective The purpose of this study is to develop and apply a methodology for evaluating the identity disclosure risks of fully synthetic data. Methods A full risk model is presented, which evaluates both identity disclosure and the ability of an adversary to learn something new if there is a match between a synthetic record and a real person. We term this “meaningful identity disclosure risk.” The model is applied on samples from the Washington State Hospital discharge database (2007) and the Canadian COVID-19 cases database. Both of these datasets were synthesized using a sequential decision tree process commonly used to synthesize health and social science data. Results The meaningful identity disclosure risk for both of these synthesized samples was below the commonly used 0.09 risk threshold (0.0198 and 0.0086, respectively), and 4 times and 5 times lower than the risk values for the original datasets, respectively. Conclusions We have presented a comprehensive identity disclosure risk model for fully synthetic data. The results for this synthesis method on 2 datasets demonstrate that synthesis can reduce meaningful identity disclosure risks considerably. The risk model can be applied in the future to evaluate the privacy of fully synthetic data.
APA, Harvard, Vancouver, ISO, and other styles
30

Ivanovs, Maksims, Kaspars Ozols, Artis Dobrajs, and Roberts Kadikis. "Improving Semantic Segmentation of Urban Scenes for Self-Driving Cars with Synthetic Images." Sensors 22, no. 6 (March 14, 2022): 2252. http://dx.doi.org/10.3390/s22062252.

Full text
Abstract:
Semantic segmentation of an incoming visual stream from cameras is an essential part of the perception system of self-driving cars. State-of-the-art results in semantic segmentation have been achieved with deep neural networks (DNNs), yet training them requires large datasets, which are difficult and costly to acquire and time-consuming to label. A viable alternative to training DNNs solely on real-world datasets is to augment them with synthetic images, which can be easily modified and generated in large numbers. In the present study, we aim at improving the accuracy of semantic segmentation of urban scenes by augmenting the Cityscapes real-world dataset with synthetic images generated with the open-source driving simulator CARLA (Car Learning to Act). Augmentation with synthetic images with a low degree of photorealism from the MICC-SRI (Media Integration and Communication Center–Semantic Road Inpainting) dataset does not result in the improvement of the accuracy of semantic segmentation, yet both MobileNetV2 and Xception DNNs used in the present study demonstrate a better accuracy after training on the custom-made CCM (Cityscapes-CARLA Mixed) dataset, which contains both real-world Cityscapes images and high-resolution synthetic images generated with CARLA, than after training only on the real-world Cityscapes images. However, the accuracy of semantic segmentation does not improve proportionally to the amount of the synthetic data used for augmentation, which indicates that augmentation with a larger amount of synthetic data is not always better.
APA, Harvard, Vancouver, ISO, and other styles
31

Jiangsha, Ai, Lulu Tian, Libing Bai, and Jie Zhang. "Data augmentation by a CycleGAN-based extra-supervised model for nondestructive testing." Measurement Science and Technology 33, no. 4 (January 31, 2022): 045017. http://dx.doi.org/10.1088/1361-6501/ac3ec3.

Full text
Abstract:
Abstract The deep learning method is widely used in computer vision tasks with large-scale annotated datasets. However, obtaining such datasets in most directions of the vision based nondestructive testing (NDT) field is very challenging. Data augmentation is proved as an efficient way of dealing with the lack of large-scale annotated datasets. In this paper, we propose a CycleGAN-based extra-supervised (CycleGAN-ES) model to generate synthetic NDT images, where the ES is used to ensure that the bidirectional mapping is learned for corresponding labels and defects. Furthermore, we show the effectiveness of using the synthesized images to train deep convolutional neural networks (DCNNs) for defect recognition. In the experiments, we extract a number of x-ray welding images with both defect and no defects from the published GDXray dataset, and CycleGAN-ES is used to generate the synthetic defect images based on a small number of extracted defect images and manually drawn labels that are used as a content guide. For quality verification of the synthesized defect images, we use a high-performance classifier pretrained using a big dataset to recognize the synthetic defects and show the comparability of the performances of classifiers trained using synthetic defects and real defects, respectively. To present the effectiveness of using the synthesized defects as an augmentation method, we train and evaluate the performances of DCNN for defect recognition with or without the synthesized defects.
APA, Harvard, Vancouver, ISO, and other styles
32

Volker, Thom Benjamin, and Gerko Vink. "Anonymiced Shareable Data: Using mice to Create and Analyze Multiply Imputed Synthetic Datasets." Psych 3, no. 4 (November 23, 2021): 703–16. http://dx.doi.org/10.3390/psych3040045.

Full text
Abstract:
Synthetic datasets simultaneously allow for the dissemination of research data while protecting the privacy and confidentiality of respondents. Generating and analyzing synthetic datasets is straightforward, yet, a synthetic data analysis pipeline is seldom adopted by applied researchers. We outline a simple procedure for generating and analyzing synthetic datasets with the multiple imputation software mice (Version 3.13.15) in R. We demonstrate through simulations that the analysis results obtained on synthetic data yield unbiased and valid inferences and lead to synthetic records that cannot be distinguished from the true data records. The ease of use when synthesizing data with mice along with the validity of inferences obtained through this procedure opens up a wealth of possibilities for data dissemination and further research on initially private data.
APA, Harvard, Vancouver, ISO, and other styles
33

Gunna, Sanjana, Rohit Saluja, and Cheerakkuzhi Veluthemana Jawahar. "Improving Scene Text Recognition for Indian Languages with Transfer Learning and Font Diversity." Journal of Imaging 8, no. 4 (March 23, 2022): 86. http://dx.doi.org/10.3390/jimaging8040086.

Full text
Abstract:
Reading Indian scene texts is complex due to the use of regional vocabulary, multiple fonts/scripts, and text size. This work investigates the significant differences in Indian and Latin Scene Text Recognition (STR) systems. Recent STR works rely on synthetic generators that involve diverse fonts to ensure robust reading solutions. We present utilizing additional non-Unicode fonts with generally employed Unicode fonts to cover font diversity in such synthesizers for Indian languages. We also perform experiments on transfer learning among six different Indian languages. Our transfer learning experiments on synthetic images with common backgrounds provide an exciting insight that Indian scripts can benefit from each other than from the extensive English datasets. Our evaluations for the real settings help us achieve significant improvements over previous methods on four Indian languages from standard datasets like IIIT-ILST, MLT-17, and the new dataset (we release) containing 440 scene images with 500 Gujarati and 2535 Tamil words. Further enriching the synthetic dataset with non-Unicode fonts and multiple augmentations helps us achieve a remarkable Word Recognition Rate gain of over 33% on the IIIT-ILST Hindi dataset. We also present the results of lexicon-based transcription approaches for all six languages.
APA, Harvard, Vancouver, ISO, and other styles
34

Carrara, Matteo, Marco Beccuti, Fulvio Lazzarato, Federica Cavallo, Francesca Cordero, Susanna Donatelli, and Raffaele A. Calogero. "State-of-the-Art Fusion-Finder Algorithms Sensitivity and Specificity." BioMed Research International 2013 (2013): 1–6. http://dx.doi.org/10.1155/2013/340620.

Full text
Abstract:
Background. Gene fusions arising from chromosomal translocations have been implicated in cancer. RNA-seq has the potential to discover such rearrangements generating functional proteins (chimera/fusion). Recently, many methods for chimeras detection have been published. However, specificity and sensitivity of those tools were not extensively investigated in a comparative way.Results. We tested eight fusion-detection tools (FusionHunter, FusionMap, FusionFinder, MapSplice, deFuse, Bellerophontes, ChimeraScan, and TopHat-fusion) to detect fusion events using synthetic and real datasets encompassing chimeras. The comparison analysis run only on synthetic data could generate misleading results since we found no counterpart on real dataset. Furthermore, most tools report a very high number of false positive chimeras. In particular, the most sensitive tool, ChimeraScan, reports a large number of false positives that we were able to significantly reduce by devising and applying two filters to remove fusions not supported by fusion junction-spanning reads or encompassing large intronic regions.Conclusions. The discordant results obtained using synthetic and real datasets suggest that synthetic datasets encompassing fusion events may not fully catch the complexity of RNA-seq experiment. Moreover, fusion detection tools are still limited in sensitivity or specificity; thus, there is space for further improvement in the fusion-finder algorithms.
APA, Harvard, Vancouver, ISO, and other styles
35

Mahmood, Ammar, Mohammed Bennamoun, Senjian An, Ferdous Sohel, Farid Boussaid, Renae Hovey, and Gary Kendrick. "Automatic detection of Western rock lobster using synthetic data." ICES Journal of Marine Science 77, no. 4 (November 22, 2019): 1308–17. http://dx.doi.org/10.1093/icesjms/fsz223.

Full text
Abstract:
Abstract Underwater imaging is being extensively used for monitoring the abundance of lobster species and their biodiversity in their local habitats. However, manual assessment of these images requires a huge amount of human effort. In this article, we propose to automate the process of lobster detection using a deep learning technique. A major obstacle in deploying such an automatic framework for the localization of lobsters in diverse environments is the lack of large annotated training datasets. Generating synthetic datasets to train these object detection models has become a popular approach. However, the current synthetic data generation frameworks rely on automatic segmentation of objects of interest, which becomes difficult when the objects have a complex shape, such as lobster. To overcome this limitation, we propose an approach to synthetically generate parts of the lobster. To handle the variability of real-world images, these parts were inserted into a set of diverse background marine images to generate a large synthetic dataset. A state-of-the-art object detector was trained using this synthetic parts dataset and tested on the challenging task of Western rock lobster detection in West Australian seas. To the best of our knowledge, this is the first automatic lobster detection technique for partially visible and occluded lobsters.
APA, Harvard, Vancouver, ISO, and other styles
36

Mukherjee, Mimi, and Matloob Khushi. "SMOTE-ENC: A Novel SMOTE-Based Method to Generate Synthetic Data for Nominal and Continuous Features." Applied System Innovation 4, no. 1 (March 2, 2021): 18. http://dx.doi.org/10.3390/asi4010018.

Full text
Abstract:
Real-world datasets are heavily skewed where some classes are significantly outnumbered by the other classes. In these situations, machine learning algorithms fail to achieve substantial efficacy while predicting these underrepresented instances. To solve this problem, many variations of synthetic minority oversampling methods (SMOTE) have been proposed to balance datasets which deal with continuous features. However, for datasets with both nominal and continuous features, SMOTE-NC is the only SMOTE-based oversampling technique to balance the data. In this paper, we present a novel minority oversampling method, SMOTE-ENC (SMOTE—Encoded Nominal and Continuous), in which nominal features are encoded as numeric values and the difference between two such numeric values reflects the amount of change of association with the minority class. Our experiments show that classification models using the SMOTE-ENC method offer better prediction than models using SMOTE-NC when the dataset has a substantial number of nominal features and also when there is some association between the categorical features and the target class. Additionally, our proposed method addressed one of the major limitations of the SMOTE-NC algorithm. SMOTE-NC can be applied only on mixed datasets that have features consisting of both continuous and nominal features and cannot function if all the features of the dataset are nominal. Our novel method has been generalized to be applied to both mixed datasets and nominal-only datasets.
APA, Harvard, Vancouver, ISO, and other styles
37

Weitz, Darío, Denis María, Franco Lianza, Nicole Schmidt, and Juan Pablo Nant. "Smart home simulation model for synthetic sensor datasets generation." Sistemas y Telemática 14, no. 39 (December 1, 2016): 71–84. http://dx.doi.org/10.18046/syt.v14i39.2350.

Full text
Abstract:
World population is ageing due to longer life expectancy worldwide. There is a trend in elderly people to live alone in their habitual residences in spite of health and safety risks. Smart Homes, intelligent environment systems deployed at elderly homes can act as early warning systems trying to forecast the worsening or exacerbation of the resident chronic conditions. Access to sensor datasets is essential for the development of an efficient real smart home. Procurement of such datasets is subject to several restrictions and difficulties. This paper describes the generation of synthetic datasets by means of a simulation model as a suitable alternative previous to the deployment of a real monitoring system. The collection of synthetic datasets will be used during the next project step to train and evaluate activity recognition methods and algorithms.
APA, Harvard, Vancouver, ISO, and other styles
38

Avraam, Demetris, Rebecca C. Wilson, and Paul Burton. "Synthetic ALSPAC longitudinal datasets for the Big Data VR project." Wellcome Open Research 2 (August 30, 2017): 74. http://dx.doi.org/10.12688/wellcomeopenres.12441.1.

Full text
Abstract:
Three synthetic datasets - of observation size 15,000, 155,000 and 1,555,000 participants, respectively - were created by simulating eleven cardiac and anthropometric variables from nine collection ages of the ALSAPC birth cohort study. The synthetic datasets retain similar data properties to the ALSPAC study data they are simulated from (co-variance matrices, as well as the mean and variance values of the variables) without including the original data itself or disclosing participant information. In this instance, the three synthetic datasets have been utilised in an academia-industry collaboration to build a prototype virtual reality data analysis software, but they could have a broader use in method and software development projects where sensitive data cannot be freely shared.
APA, Harvard, Vancouver, ISO, and other styles
39

Triastcyn, Aleksei, and Boi Faltings. "Generating Higher-Fidelity Synthetic Datasets with Privacy Guarantees." Algorithms 15, no. 7 (July 1, 2022): 232. http://dx.doi.org/10.3390/a15070232.

Full text
Abstract:
We consider the problem of enhancing user privacy in common data analysis and machine learning development tasks, such as data annotation and inspection, by substituting the real data with samples from a generative adversarial network. We propose employing Bayesian differential privacy as the means to achieve a rigorous theoretical guarantee while providing a better privacy-utility trade-off. We demonstrate experimentally that our approach produces higher-fidelity samples compared to prior work, allowing to (1) detect more subtle data errors and biases, and (2) reduce the need for real data labelling by achieving high accuracy when training directly on artificial samples.
APA, Harvard, Vancouver, ISO, and other styles
40

Dilkina, Bistra, Katherine Lai, Ronan Le Bras, Yexiang Xue, Carla Gomes, Ashish Sabharwal, Jordan Suter, Kevin McKelvey, Michael Schwartz, and Claire Montgomery. "Large Landscape Conservation — Synthetic and Real-World Datasets." Proceedings of the AAAI Conference on Artificial Intelligence 27, no. 1 (June 29, 2013): 1369–72. http://dx.doi.org/10.1609/aaai.v27i1.8489.

Full text
Abstract:
Biodiversity underpins ecosystem goods and services and hence protecting it is key to achieving sustainability. However, the persistence of many species is threatened by habitat loss and fragmentation due to human land use and climate change. Conservation efforts are implemented under very limited economic resources, and therefore designing scalable, cost-efficient and systematic approaches for conservation planning is an important and challenging computational task. In particular, preserving landscape connectivity between good habitat has become a key conservation priority in recent years. We give an overview of landscape connectivity conservation and some of the underlying graph-theoretic optimization problems. We present a synthetic generator capable of creating families of randomized structured problems, capturing the essential features of real-world instances but allowing for a thorough typical-case performance evaluation of different solution methods. We also present two large-scale real-world datasets, including economic data on land cost, and species data for grizzly bears, wolverines and lynx.
APA, Harvard, Vancouver, ISO, and other styles
41

Larrañeta, M., C. Fernandez-Peruchena, M. A. Silva-Pérez, I. Lillo-bravo, A. Grantham, and J. Boland. "Generation of synthetic solar datasets for risk analysis." Solar Energy 187 (July 2019): 212–25. http://dx.doi.org/10.1016/j.solener.2019.05.042.

Full text
APA, Harvard, Vancouver, ISO, and other styles
42

Tomás, Jimena Torres, Newton Spolaôr, Everton Alvares Cherman, and Maria Carolina Monard. "A Framework to Generate Synthetic Multi-label Datasets." Electronic Notes in Theoretical Computer Science 302 (February 2014): 155–76. http://dx.doi.org/10.1016/j.entcs.2014.01.025.

Full text
APA, Harvard, Vancouver, ISO, and other styles
43

Garrow, Laurie A., Tudor D. Bodea, and Misuk Lee. "Generation of synthetic datasets for discrete choice analysis." Transportation 37, no. 2 (October 14, 2009): 183–202. http://dx.doi.org/10.1007/s11116-009-9228-6.

Full text
APA, Harvard, Vancouver, ISO, and other styles
44

Barbierato, Enrico, Marco L. Della Vedova, Daniele Tessera, Daniele Toti, and Nicola Vanoli. "A Methodology for Controlling Bias and Fairness in Synthetic Data Generation." Applied Sciences 12, no. 9 (May 4, 2022): 4619. http://dx.doi.org/10.3390/app12094619.

Full text
Abstract:
The development of algorithms, based on machine learning techniques, supporting (or even replacing) human judgment must take into account concepts such as data bias and fairness. Though scientific literature proposes numerous techniques to detect and evaluate these problems, less attention has been dedicated to methods generating intentionally biased datasets, which could be used by data scientists to develop and validate unbiased and fair decision-making algorithms. To this end, this paper presents a novel method to generate a synthetic dataset, where bias can be modeled by using a probabilistic network exploiting structural equation modeling. The proposed methodology has been validated on a simple dataset to highlight the impact of tuning parameters on bias and fairness, as well as on a more realistic example based on a loan approval status dataset. In particular, this methodology requires a limited number of parameters compared to other techniques for generating datasets with a controlled amount of bias and fairness.
APA, Harvard, Vancouver, ISO, and other styles
45

Baidari, Ishwar, and Channamma Patil. "A Criterion for Deciding the Number of Clusters in a Dataset Based on Data Depth." Vietnam Journal of Computer Science 07, no. 04 (July 8, 2020): 417–31. http://dx.doi.org/10.1142/s2196888820500232.

Full text
Abstract:
Clustering is a key method in unsupervised learning with various applications in data mining, pattern recognition and intelligent information processing. However, the number of groups to be formed, usually notated as [Formula: see text] is a vital parameter for most of the existing clustering algorithms as their clustering results depend heavily on this parameter. The problem of finding the optimal [Formula: see text] value is very challenging. This paper proposes a novel idea for finding the correct number of groups in a dataset based on data depth. The idea is to avoid the traditional process of running the clustering algorithm over a dataset for [Formula: see text] times and further, finding the [Formula: see text] value for a dataset without setting any specific search range for [Formula: see text] parameter. We experiment with different indices, namely CH, KL, Silhouette, Gap, CSP and the proposed method on different real and synthetic datasets to estimate the correct number of groups in a dataset. The experimental results on real and synthetic datasets indicate good performance of the proposed method.
APA, Harvard, Vancouver, ISO, and other styles
46

Mizginov, V. A., and S. Y. Danilov. "SYNTHETIC THERMAL BACKGROUND AND OBJECT TEXTURE GENERATION USING GEOMETRIC INFORMATION AND GAN." ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences XLII-2/W12 (May 9, 2019): 149–54. http://dx.doi.org/10.5194/isprs-archives-xlii-2-w12-149-2019.

Full text
Abstract:
<p><strong>Abstract.</strong> Nowadays methods based on deep neural networks show the best performance among image recognition and object detection algorithms. Nevertheless, such methods require to have large databases of multispectral images of various objects to achieve state-of-the-art results. Therefore the dataset generation is one of the major challenges for the successful training of a deep neural network. However, infrared image datasets that are large enough for successful training of a deep neural network are not available in the public domain. Generation of synthetic datasets using 3D models of various scenes is a time-consuming method that requires long computation time and is not very realistic. This paper is focused on the development of the method for thermal image synthesis using a GAN (generative adversarial network). The aim of the presented work is to expand and complement the existing datasets of real thermal images. Today, deep convolutional networks are increasingly used for the goal of synthesizing various images. Recently a new generation of such algorithms commonly called GAN has become a promising tool for synthesizing images of various spectral ranges. These networks show effective results for image-to-image translations. While it is possible to generate a thermal texture for a single object, generation of environment textures is extremely difficult due to the presence of a large number of objects with different emission sources. The proposed method is based on a joint approach that uses 3D modeling and deep learning. Synthesis of background textures and objects textures is performed using a generative-adversarial neural network and semantic and geometric information about objects generated using 3D modeling. The developed approach significantly improves the realism of the synthetic images, especially in terms of the quality of background textures.</p>
APA, Harvard, Vancouver, ISO, and other styles
47

Hopwood, Michael W., Joshua S. Stein, Jennifer L. Braid, and Hubert P. Seigneur. "Physics-Based Method for Generating Fully Synthetic IV Curve Training Datasets for Machine Learning Classification of PV Failures." Energies 15, no. 14 (July 12, 2022): 5085. http://dx.doi.org/10.3390/en15145085.

Full text
Abstract:
Classification machine learning models require high-quality labeled datasets for training. Among the most useful datasets for photovoltaic array fault detection and diagnosis are module or string current-voltage (IV) curves. Unfortunately, such datasets are rarely collected due to the cost of high fidelity monitoring, and the data that is available is generally not ideal, often consisting of unbalanced classes, noisy data due to environmental conditions, and few samples. In this paper, we propose an alternate approach that utilizes physics-based simulations of string-level IV curves as a fully synthetic training corpus that is independent of the test dataset. In our example, the training corpus consists of baseline (no fault), partial soiling, and cell crack system modes. The training corpus is used to train a 1D convolutional neural network (CNN) for failure classification. The approach is validated by comparing the model’s ability to classify failures detected on a real, measured IV curve testing corpus obtained from laboratory and field experiments. Results obtained using a fully synthetic training dataset achieve identical accuracy to those obtained with use of a measured training dataset. When evaluating the measured data’s test split, a 100% accuracy was found both when using simulations or measured data as the training corpus. When evaluating all of the measured data, a 96% accuracy was found when using a fully synthetic training dataset. The use of physics-based modeling results as a training corpus for failure detection and classification has many advantages for implementation as each PV system is configured differently, and it would be nearly impossible to train using labeled measured data.
APA, Harvard, Vancouver, ISO, and other styles
48

Danesh, Hajar, Keivan Maghooli, Alireza Dehghani, and Rahele Kafieh. "Synthetic OCT data in challenging conditions: three-dimensional OCT and presence of abnormalities." Medical & Biological Engineering & Computing 60, no. 1 (November 18, 2021): 189–203. http://dx.doi.org/10.1007/s11517-021-02469-w.

Full text
Abstract:
AbstractNowadays, retinal optical coherence tomography (OCT) plays an important role in ophthalmology and automatic analysis of the OCT is of real importance: image denoising facilitates a better diagnosis and image segmentation and classification are undeniably critical in treatment evaluation. Synthetic OCT was recently considered to provide a benchmark for quantitative comparison of automatic algorithms and to be utilized in the training stage of novel solutions based on deep learning. Due to complicated data structure in retinal OCTs, a limited number of delineated OCT datasets are already available in presence of abnormalities; furthermore, the intrinsic three-dimensional (3D) structure of OCT is ignored in many public 2D datasets. We propose a new synthetic method, applicable to 3D data and feasible in presence of abnormalities like diabetic macular edema (DME). In this method, a limited number of OCT data is used during the training step and the Active Shape Model is used to produce synthetic OCTs plus delineation of retinal boundaries and location of abnormalities. Statistical comparison of thickness maps showed that synthetic dataset can be used as a statistically acceptable representative of the original dataset (p > 0.05). Visual inspection of the synthesized vessels was also promising. Regarding the texture features of the synthesized datasets, Q-Q plots were used, and even in cases that the points have slightly digressed from the straight line, the p-values of the Kolmogorov–Smirnov test rejected the null hypothesis and showed the same distribution in texture features of the real and the synthetic data. The proposed algorithm provides a unique benchmark for comparison of OCT enhancement methods and a tailored augmentation method to overcome the limited number of OCTs in deep learning algorithms. Graphical abstract
APA, Harvard, Vancouver, ISO, and other styles
49

Danilov, V. V., O. M. Gerget, D. Y. Kolpashchikov, N. V. Laptev, R. A. Manakov, L. A. Hérnandez-Gómez, F. Alvarez, and M. J. Ledesma-Carbayo. "BOOSTING SEGMENTATION ACCURACY OF THE DEEP LEARNING MODELS BASED ON THE SYNTHETIC DATA GENERATION." International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences XLIV-2/W1-2021 (April 15, 2021): 33–40. http://dx.doi.org/10.5194/isprs-archives-xliv-2-w1-2021-33-2021.

Full text
Abstract:
Abstract. In the era of data-driven machine learning algorithms, data represents a new oil. The application of machine learning algorithms shows they need large heterogeneous datasets that crucially are correctly labeled. However, data collection and its labeling are time-consuming and labor-intensive processes. A particular task we solve using machine learning is related to the segmentation of medical devices in echocardiographic images during minimally invasive surgery. However, the lack of data motivated us to develop an algorithm generating synthetic samples based on real datasets. The concept of this algorithm is to place a medical device (catheter) in an empty cavity of an anatomical structure, for example, in a heart chamber, and then transform it. To create random transformations of the catheter, the algorithm uses a coordinate system that uniquely identifies each point regardless of the bend and the shape of the object. It is proposed to take a cylindrical coordinate system as a basis, modifying it by replacing the Z-axis with a spline along which the h-coordinate is measured. Having used the proposed algorithm, we generated new images with the catheter inserted into different heart cavities while varying its location and shape. Afterward, we compared the results of deep neural networks trained on the datasets comprised of real and synthetic data. The network trained on both real and synthetic datasets performed more accurate segmentation than the model trained only on real data. For instance, modified U-net trained on combined datasets performed segmentation with the Dice similarity coefficient of 92.6±2.2%, while the same model trained only on real samples achieved the level of 86.5±3.6%. Using a synthetic dataset allowed decreasing the accuracy spread and improving the generalization of the model. It is worth noting that the proposed algorithm allows reducing subjectivity, minimizing the labeling routine, increasing the number of samples, and improving the heterogeneity.
APA, Harvard, Vancouver, ISO, and other styles
50

Ferriyan, Andrey, Achmad Husni Thamrin, Keiji Takeda, and Jun Murai. "Generating Network Intrusion Detection Dataset Based on Real and Encrypted Synthetic Attack Traffic." Applied Sciences 11, no. 17 (August 26, 2021): 7868. http://dx.doi.org/10.3390/app11177868.

Full text
Abstract:
The lack of publicly available up-to-date datasets contributes to the difficulty in evaluating intrusion detection systems. This paper introduces HIKARI-2021, a dataset that contains encrypted synthetic attacks and benign traffic. This dataset conforms to two requirements: the content requirements, which focus on the produced dataset, and the process requirements, which focus on how the dataset is built. We compile these requirements to enable future dataset developments and we make the HIKARI-2021 dataset, along with the procedures to build it, available for the public.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography