Log in

Relevant bibliographies by topics / Datasety / Journal articles

To see the other types of publications on this topic, follow the link: Datasety.

Journal articles on the topic 'Datasety'

Author: Grafiati

Published: 28 June 2021

Last updated: 7 February 2022

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Datasety.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Almeida, Daniela, Dany Domínguez-Pérez, Ana Matos, Guillermin Agüero-Chapin, Yuselis Castaño, Vitor Vasconcelos, Alexandre Campos, and Agostinho Antunes. "Data Employed in the Construction of a Composite Protein Database for Proteogenomic Analyses of Cephalopods Salivary Apparatus." Data 5, no. 4 (November 27, 2020): 110. http://dx.doi.org/10.3390/data5040110.

Full text

Abstract:

Here we provide all datasets and details applied in the construction of a composite protein database required for the proteogenomic analyses of the article “Putative Antimicrobial Peptides of the Posterior Salivary Glands from the Cephalopod Octopus vulgaris Revealed by Exploring a Composite Protein Database”. All data, subdivided into six datasets, are deposited at the Mendeley Data repository as follows. Dataset_1 provides our composite database “All_Databases_5950827_sequences.fasta” derived from six smaller databases composed of (i) protein sequences retrieved from public databases related to cephalopods’ salivary glands, (ii) proteins identified with Proteome Discoverer software using our original data obtained by shotgun proteomic analyses of posterior salivary glands (PSGs) from three Octopus vulgaris specimens (provided as Dataset_2) and (iii) a non-redundant antimicrobial peptide (AMP) database. Dataset_3 includes the transcripts obtained by de novo assembly of 16 transcriptomes from cephalopods’ PSGs using CLC Genomics Workbench. Dataset_4 provides the proteins predicted by the TransDecoder tool from the de novo assembly of 16 transcriptomes of cephalopods’ PSGs. Further details about database construction, as well as the scripts and command lines used to construct them, are deposited within Dataset_5 and Dataset_6. The data provided in this article will assist in unravelling the role of cephalopods’ PSGs in feeding strategies, toxins and AMP production.

APA, Harvard, Vancouver, ISO, and other styles

2

Haider, S. A., and N. S. Patil. "Minimization of Datasets : Using a Master Interlinked Dataset." Indian Journal of Computer Science 3, no. 5 (October 1, 2018): 20. http://dx.doi.org/10.17010/ijcs/2018/v3/i5/138778.

Full text

APA, Harvard, Vancouver, ISO, and other styles

3

Feng, Eric, and Xijin Ge. "DataViz: visualization of high-dimensional data in virtual reality." F1000Research 7 (October 23, 2018): 1687. http://dx.doi.org/10.12688/f1000research.16453.1.

Full text

Abstract:

Virtual reality (VR) simulations promote interactivity and immersion, and provide an opportunity that may help researchers gain insights from complex datasets. To explore the utility and potential of VR in graphically rendering large datasets, we have developed an application for immersive, 3-dimensional (3D) scatter plots. Developed using the Unity development environment, DataViz enables the visualization of high-dimensional data with the HTC Vive, a relatively inexpensive and modern virtual reality headset available to the general public. DataViz has the following features: (1) principal component analysis (PCA) of the dataset; (2) graphical rendering of said dataset’s 3D projection onto its first three principal components; and (3) intuitive controls and instructions for using the application. As a use case, we applied DataViz to visualize a single-cell RNA-Seq dataset. DataViz can help gain insights from complex datasets by enabling interaction with high-dimensional data.

APA, Harvard, Vancouver, ISO, and other styles

4

Chang, Nai Chen, Elissa Aminoff, John Pyles, Michael Tarr, and Abhinav Gupta. "Scaling Up Neural Datasets: A public fMRI dataset of 5000 scenes." Journal of Vision 18, no. 10 (September 1, 2018): 732. http://dx.doi.org/10.1167/18.10.732.

Full text

APA, Harvard, Vancouver, ISO, and other styles

5

Zhang, Yulian, and Shigeyuki Hamori. "Forecasting Crude Oil Market Crashes Using Machine Learning Technologies." Energies 13, no. 10 (May 13, 2020): 2440. http://dx.doi.org/10.3390/en13102440.

Full text

Abstract:

To the best of our knowledge, this study provides new insight into the forecasting of crude oil futures price crashes in America, employing a moving window. One is the fixed-length window and the other is the expanding-length window, which has never been reported in the past. We aimed to investigate if there is any difference when historical data are discarded. As the explanatory variables, we adapted 13 variables to obtain two datasets, 16 explanatory variables for Dataset1 and 121 explanatory variables for Dataset2. We try to observe results from the different-sized sets of explanatory variables. Specifically, we leverage the merits of a series of machine learning techniques, which include random forests, logistic regression, support vector machines, and extreme gradient boosting (XGBoost). Finally, we employ the evaluation metrics that are broadly used to assess the discriminatory power of imbalanced datasets. Our results indicate that we should occasionally discard distant historical data, and that XGBoost outperforms the other employed approaches, achieving a detection rate as high as 86% using the fixed-length moving window for Dataset2.

APA, Harvard, Vancouver, ISO, and other styles

6

Wang, Juan, Zhibin Zhang, and Yanjuan Li. "Constructing Phylogenetic Networks Based on the Isomorphism of Datasets." BioMed Research International 2016 (2016): 1–7. http://dx.doi.org/10.1155/2016/4236858.

Full text

Abstract:

Constructing rooted phylogenetic networks from rooted phylogenetic trees has become an important problem in molecular evolution. So far, many methods have been presented in this area, in which most efficient methods are based on the incompatible graph, such as the CASS, the LNETWORK,and the BIMLR. This paper will research the commonness of the methods based on the incompatible graph, the relationship between incompatible graph and the phylogenetic network, and the topologies of incompatible graphs. We can find out all the simplest datasets for a topologyGand construct a network for every dataset. For any one datasetC, we can compute a network from the network representing the simplest dataset which is isomorphic toC. This process will save more time for the algorithms when constructing networks.

APA, Harvard, Vancouver, ISO, and other styles

7

Xie, Yanqing, Zhengqiang Li, Weizhen Hou, Jie Guang, Yan Ma, Yuyang Wang, Siheng Wang, and Dong Yang. "Validation of FY-3D MERSI-2 Precipitable Water Vapor (PWV) Datasets Using Ground-Based PWV Data from AERONET." Remote Sensing 13, no. 16 (August 16, 2021): 3246. http://dx.doi.org/10.3390/rs13163246.

Full text

Abstract:

The medium resolution spectral imager-2 (MERSI-2) is one of the most important sensors onboard China’s latest polar-orbiting meteorological satellite, Fengyun-3D (FY-3D). The National Satellite Meteorological Center of China Meteorological Administration has developed four precipitable water vapor (PWV) datasets using five near-infrared bands of MERSI-2, including the P905 dataset, P936 dataset, P940 dataset and the fusion dataset of the above three datasets. For the convenience of users, we comprehensively evaluate the quality of these PWV datasets with the ground-based PWV data derived from Aerosol Robotic Network. The validation results show that the P905, P936 and fused PWV datasets have relatively large systematic errors (−0.10, −0.11 and −0.07 g/cm2), whereas the systematic error of the P940 dataset (−0.02 g/cm2) is very small. According to the overall accuracy of these four PWV datasets by our assessments, they can be ranked in descending order as P940 dataset, fused dataset, P936 dataset and P905 dataset. The root mean square error (RMSE), relative error (RE) and percentage of retrieval results with error within ±(0.05+0.10∗PWVAERONET) (PER10) of the P940 PWV dataset are 0.24 g/cm2, 0.10 and 76.36%, respectively. The RMSE, RE and PER10 of the P905 PWV dataset are 0.38 g/cm2, 0.15 and 57.72%, respectively. In order to obtain a clearer understanding of the accuracy of these four MERSI-2 PWV datasets, we compare the accuracy of these four MERSI-2 PWV datasets with that of the widely used MODIS PWV dataset and AIRS PWV dataset. The results of the comparison show that the accuracy of the MODIS PWV dataset is not as good as that of all four MERSI-2 PWV datasets, due to the serious overestimation of the MODIS PWV dataset (0.40 g/cm2), and the accuracy of the AIRS PWV dataset is worse than that of the P940 and fused MERSI-2 PWV datasets. In addition, we analyze the error distribution of the four PWV datasets in different locations, seasons and water vapor content. Finally, the reason why the fused PWV dataset is not the one with the highest accuracy among the four PWV datasets is discussed.

APA, Harvard, Vancouver, ISO, and other styles

8

Bahrami, Mostafa, Hossein Javadikia, and Ebrahim Ebrahimi. "APPLICATION OF PATTERN RECOGNITION TECHNIQUES FOR FAULT DETECTION OF CLUTCH RETAINER OF TRACTOR." Journal of Mechanical Engineering 47, no. 1 (May 1, 2018): 31–36. http://dx.doi.org/10.3329/jme.v47i1.35356.

Full text

Abstract:

This study develops a technique based on pattern recognition for fault diagnosis of clutch retainer mechanism of MF285 tractor using the neural network. In this technique, time features and frequency domain features consist of Fast Fourier Transform (FFT) phase angle and Power Spectral Density (PSD) proposes to improve diagnosis ability. Three different cases, such as: normal condition, bearing wears and shaft wears were applied for signal processing. The data divides in two parts; in part one 70% data are dataset1 and in part two 30% for dataset2.At first, the artificial neural networks (ANN) are trained by 60% dataset1 and validated by 20% dataset1 and tested by 20% dataset1. Then, to more test of the proposed model, the network using the datasets2 are simulated. The results indicate effective ability in accurate diagnosis of various clutch retainer mechanism of MF285 tractor faults using pattern recognition networks.

APA, Harvard, Vancouver, ISO, and other styles

9

Bogaardt, Laurens, Romulo Goncalves, Raul Zurita-Milla, and Emma Izquierdo-Verdiguier. "Dataset Reduction Techniques to Speed Up SVD Analyses on Big Geo-Datasets." ISPRS International Journal of Geo-Information 8, no. 2 (January 26, 2019): 55. http://dx.doi.org/10.3390/ijgi8020055.

Full text

Abstract:

The Singular Value Decomposition (SVD) is a mathematical procedure with multiple applications in the geosciences. For instance, it is used in dimensionality reduction and as a support operator for various analytical tasks applicable to spatio-temporal data. Performing SVD analyses on large datasets, however, can be computationally costly, time consuming, and sometimes practically infeasible. However, techniques exist to arrive at the same output, or at a close approximation, which requires far less effort. This article examines several such techniques in relation to the inherent scale of the structure within the data. When the values of a dataset vary slowly, e.g., in a spatial field of temperature over a country, there is autocorrelation and the field contains large scale structure. Datasets do not need a high resolution to describe such fields and their analysis can benefit from alternative SVD techniques based on rank deficiency, coarsening, or matrix factorization approaches. We use both simulated Gaussian Random Fields with various levels of autocorrelation and real-world geospatial datasets to illustrate our study while examining the accuracy of various SVD techniques. As the main result, this article provides researchers with a decision tree indicating which technique to use when and predicting the resulting level of accuracy based on the dataset’s structure scale.

APA, Harvard, Vancouver, ISO, and other styles

10

Yu, Ellen, Aparna Bhaskaran, Shang-Lin Chen, Zachary E. Ross, Egill Hauksson, and Robert W. Clayton. "Southern California Earthquake Data Now Available in the AWS Cloud." Seismological Research Letters 92, no. 5 (June 16, 2021): 3238–47. http://dx.doi.org/10.1785/0220210039.

Full text

Abstract:

Abstract The Southern California Earthquake Data Center is hosting its earthquake catalog and seismic waveform archive in the Amazon Web Services (AWS) Open Dataset Program (s3://scedc-pds; us-west-2 region). The cloud dataset’s high data availability and scalability facilitate research that uses large volumes of data and computationally intensive processing. We describe the data archive and our rationale for the formats and data organization. We provide two simple examples to show how storing the data in AWS Simple Storage Service can benefit the analysis of large datasets. We share usage statistics of our data during the first year in the AWS Open Dataset Program. We also discuss the challenges and opportunities of a cloud-hosted archive.

APA, Harvard, Vancouver, ISO, and other styles

11

Waliser, Duane, Peter J. Gleckler, Robert Ferraro, Karl E. Taylor, Sasha Ames, James Biard, Michael G. Bosilovich, et al. "Observations for Model Intercomparison Project (Obs4MIPs): status for CMIP6." Geoscientific Model Development 13, no. 7 (July 7, 2020): 2945–58. http://dx.doi.org/10.5194/gmd-13-2945-2020.

Full text

Abstract:

Abstract. The Observations for Model Intercomparison Project (Obs4MIPs) was initiated in 2010 to facilitate the use of observations in climate model evaluation and research, with a particular target being the Coupled Model Intercomparison Project (CMIP), a major initiative of the World Climate Research Programme (WCRP). To this end, Obs4MIPs (1) targets observed variables that can be compared to CMIP model variables; (2) utilizes dataset formatting specifications and metadata requirements closely aligned with CMIP model output; (3) provides brief technical documentation for each dataset, designed for nonexperts and tailored towards relevance for model evaluation, including information on uncertainty, dataset merits, and limitations; and (4) disseminates the data through the Earth System Grid Federation (ESGF) platforms, making the observations searchable and accessible via the same portals as the model output. Taken together, these characteristics of the organization and structure of obs4MIPs should entice a more diverse community of researchers to engage in the comparison of model output with observations and to contribute to a more comprehensive evaluation of the climate models. At present, the number of obs4MIPs datasets has grown to about 80; many are undergoing updates, with another 20 or so in preparation, and more than 100 are proposed and under consideration. A partial list of current global satellite-based datasets includes humidity and temperature profiles; a wide range of cloud and aerosol observations; ocean surface wind, temperature, height, and sea ice fraction; surface and top-of-atmosphere longwave and shortwave radiation; and ozone (O3), methane (CH4), and carbon dioxide (CO2) products. A partial list of proposed products expected to be useful in analyzing CMIP6 results includes the following: alternative products for the above quantities, additional products for ocean surface flux and chlorophyll products, a number of vegetation products (e.g., FAPAR, LAI, burned area fraction), ice sheet mass and height, carbon monoxide (CO), and nitrogen dioxide (NO2). While most existing obs4MIPs datasets consist of monthly-mean gridded data over the global domain, products with higher time resolution (e.g., daily) and/or regional products are now receiving more attention. Along with an increasing number of datasets, obs4MIPs has implemented a number of capability upgrades including (1) an updated obs4MIPs data specifications document that provides additional search facets and generally improves congruence with CMIP6 specifications for model datasets, (2) a set of six easily understood indicators that help guide users as to a dataset's maturity and suitability for application, and (3) an option to supply supplemental information about a dataset beyond what can be found in the standard metadata. With the maturation of the obs4MIPs framework, the dataset inclusion process, and the dataset formatting guidelines and resources, the scope of the observations being considered is expected to grow to include gridded in situ datasets as well as datasets with a regional focus, and the ultimate intent is to judiciously expand this scope to any observation dataset that has applicability for evaluation of the types of Earth system models used in CMIP.

APA, Harvard, Vancouver, ISO, and other styles

12

Kusetogullari, Huseyin, Amir Yavariabdi, Abbas Cheddad, Håkan Grahn, and Johan Hall. "ARDIS: a Swedish historical handwritten digit dataset." Neural Computing and Applications 32, no. 21 (March 29, 2019): 16505–18. http://dx.doi.org/10.1007/s00521-019-04163-3.

Full text

Abstract:

Abstract This paper introduces a new image-based handwritten historical digit dataset named Arkiv Digital Sweden (ARDIS). The images in ARDIS dataset are extracted from 15,000 Swedish church records which were written by different priests with various handwriting styles in the nineteenth and twentieth centuries. The constructed dataset consists of three single-digit datasets and one-digit string dataset. The digit string dataset includes 10,000 samples in red–green–blue color space, whereas the other datasets contain 7600 single-digit images in different color spaces. An extensive analysis of machine learning methods on several digit datasets is carried out. Additionally, correlation between ARDIS and existing digit datasets Modified National Institute of Standards and Technology (MNIST) and US Postal Service (USPS) is investigated. Experimental results show that machine learning algorithms, including deep learning methods, provide low recognition accuracy as they face difficulties when trained on existing datasets and tested on ARDIS dataset. Accordingly, convolutional neural network trained on MNIST and USPS and tested on ARDIS provide the highest accuracies $$58.80\%$$ 58.80 % and $$35.44\%$$ 35.44 % , respectively. Consequently, the results reveal that machine learning methods trained on existing datasets can have difficulties to recognize digits effectively on our dataset which proves that ARDIS dataset has unique characteristics. This dataset is publicly available for the research community to further advance handwritten digit recognition algorithms.

APA, Harvard, Vancouver, ISO, and other styles

13

Sawangarreerak, Siriporn, and Putthiporn Thanathamathee. "Random Forest with Sampling Techniques for Handling Imbalanced Prediction of University Student Depression." Information 11, no. 11 (November 5, 2020): 519. http://dx.doi.org/10.3390/info11110519.

Full text

Abstract:

In this work, we propose a combined sampling technique to improve the performance of imbalanced classification of university student depression data. In experimental results, we found that combined random oversampling with the Tomek links under sampling methods allowed generating a relatively balanced depression dataset without losing significant information. In this case, the random oversampling technique was used for sampling the minority class to balance the number of samples between the datasets. Then, the Tomek links technique was used for undersampling the samples by removing the depression data considered less relevant and noisy. The relatively balanced dataset was classified by random forest. The results show that the overall accuracy in the prediction of adolescent depression data was 94.17%, outperforming the individual sampling technique. Moreover, our proposed method was tested with another dataset for its external validity. This dataset’s predictive accuracy was found to be 93.33%.

APA, Harvard, Vancouver, ISO, and other styles

14

Eum, Hyung-Il, and Anil Gupta. "Hybrid climate datasets from a climate data evaluation system and their impacts on hydrologic simulations for the Athabasca River basin in Canada." Hydrology and Earth System Sciences 23, no. 12 (December 19, 2019): 5151–73. http://dx.doi.org/10.5194/hess-23-5151-2019.

Full text

Abstract:

Abstract. A reliable climate dataset is the backbone for modelling the essential processes of the water cycle and predicting future conditions. Although a number of gridded climate datasets are available for the North American content which provide reasonable estimates of climatic conditions in the region, there are inherent inconsistencies in these available climate datasets (e.g., spatially and temporally varying data accuracies, meteorological parameters, lengths of records, spatial coverage, temporal resolution, etc.). These inconsistencies raise questions as to which datasets are the most suitable for the study area and how to systematically combine these datasets to produce a reliable climate dataset for climate studies and hydrological modelling. This study suggests a framework called the REFerence Reliability Evaluation System (REFRES) that systematically ranks multiple climate datasets to generate a hybrid climate dataset for a region. To demonstrate the usefulness of the proposed framework, REFRES was applied to produce a historical hybrid climate dataset for the Athabasca River basin (ARB) in Alberta, Canada. A proxy validation was also conducted to prove the applicability of the generated hybrid climate datasets to hydrologic simulations. This study evaluated five climate datasets, including the station-based gridded climate datasets ANUSPLIN (Australia National University Spline), Alberta Township, and the Pacific Climate Impacts Consortium's (PCIC) PNWNAmet (PCIC NorthWest North America meteorological dataset), a multi-source gridded dataset (Canadian Precipitation Analysis; CaPA), and a reanalysis-based dataset (North American Regional Reanalysis; NARR). The results showed that the gridded climate interpolated from station data performed better than multi-source- and reanalysis-based climate datasets. For the Athabasca River basin, Township and ANUSPLIN were ranked first for precipitation and temperature, respectively. The proxy validation also confirmed the utility of hybrid climate datasets in hydrologic simulations compared with the other five individual climate datasets investigated in this study. These results indicate that the hybrid climate dataset provides the best representation of historical climatic conditions and, thus, enhances the reliability of hydrologic simulations.

APA, Harvard, Vancouver, ISO, and other styles

15

Majidifard, Hamed, Peng Jin, Yaw Adu-Gyamfi, and William G. Buttlar. "Pavement Image Datasets: A New Benchmark Dataset to Classify and Densify Pavement Distresses." Transportation Research Record: Journal of the Transportation Research Board 2674, no. 2 (February 2020): 328–39. http://dx.doi.org/10.1177/0361198120907283.

Full text

Abstract:

Automated pavement distresses detection using road images remains a challenging topic in the computer vision research community. Recent developments in deep learning have led to considerable research activity directed towards improving the efficacy of automated pavement distress identification and rating. Deep learning models require a large ground truth data set, which is often not readily available in the case of pavements. In this study, a labeled dataset approach is introduced as a first step towards a more robust, easy-to-deploy pavement condition assessment system. The technique is termed herein as the pavement image dataset (PID) method. The dataset consists of images captured from two camera views of an identical pavement segment, that is, a wide view and a top-down view. The wide-view images were used to classify the distresses and to train the deep learning frameworks, while the top-down-view images allowed calculation of distress density, which will be used in future studies aimed at automated pavement rating. For the wide view group dataset, 7,237 images were manually annotated and distresses classified into nine categories. Images were extracted using the Google application programming interface (API), selecting street-view images using a python-based code developed for this project. The new dataset was evaluated using two mainstream deep learning frameworks: You Only Look Once (YOLO v2) and Faster Region Convolution Neural Network (Faster R-CNN). Accuracy scores using the F1 index were found to be 0.84 for YOLOv2 and 0.65 for the Faster R-CNN model runs; both quite acceptable considering the convenience of utilizing Google Maps images.

APA, Harvard, Vancouver, ISO, and other styles

16

Amjad, Muhammad. "The Value of Manifold Learning Algorithms in Simplifying Complex Datasets for More Efficacious Analysis." Sciential - McMaster Undergraduate Science Journal, no. 5 (December 4, 2020): 13–20. http://dx.doi.org/10.15173/sciential.v1i5.2537.

Full text

Abstract:

Advances in manifold learning have proven to be of great benefit in reducing the dimensionality of large complex datasets. Elements in an intricate dataset will typically belong in high-dimensional space as the number of individual features or independent variables will be extensive. However, these elements can be integrated into a low-dimensional manifold with well-defined parameters. By constructing a low-dimensional manifold and embedding it into high-dimensional feature space, the dataset can be simplified for easier interpretation. In spite of this elemental dimensionality reduction, the dataset’s constituents do not lose any information, but rather filter it with the hopes of elucidating the appropriate knowledge. This paper will explore the importance of this method of data analysis, its applications, and its extensions into topological data analysis.

APA, Harvard, Vancouver, ISO, and other styles

17

Morgan, Maria, Carla Blank, and Raed Seetan. "Plant disease prediction using classification algorithms." IAES International Journal of Artificial Intelligence (IJ-AI) 10, no. 1 (March 1, 2021): 257. http://dx.doi.org/10.11591/ijai.v10.i1.pp257-264.

Full text

Abstract:

<p>This paper investigates the capability of six existing classification algorithms (Artificial Neural Network, Naïve Bayes, k-Nearest Neighbor, Support Vector Machine, Decision Tree and Random Forest) in classifying and predicting diseases in soybean and mushroom datasets using datasets with numerical or categorical attributes. While many similar studies have been conducted on datasets of images to predict plant diseases, the main objective of this study is to suggest classification methods that can be used for disease classification and prediction in datasets that contain raw measurements instead of images. A fungus and a plant dataset, which had many differences, were chosen so that the findings in this paper could be applied to future research for disease prediction and classification in a variety of datasets which contain raw measurements. A key difference between the two datasets, other than one being a fungus and one being a plant, is that the mushroom dataset is balanced and only contained two classes while the soybean dataset is imbalanced and contained eighteen classes. All six algorithms performed well on the mushroom dataset, while the Artificial Neural Network and k-Nearest Neighbor algorithms performed best on the soybean dataset. The findings of this paper can be applied to future research on disease classification and prediction in a variety of dataset types such as fungi, plants, humans, and animals.</p>

APA, Harvard, Vancouver, ISO, and other styles

18

Williamson, Sinead A., and Jette Henderson. "Understanding Collections of Related Datasets Using Dependent MMD Coresets." Information 12, no. 10 (September 23, 2021): 392. http://dx.doi.org/10.3390/info12100392.

Full text

Abstract:

Understanding how two datasets differ can help us determine whether one dataset under-represents certain sub-populations, and provides insights into how well models will generalize across datasets. Representative points selected by a maximum mean discrepancy (MMD) coreset can provide interpretable summaries of a single dataset, but are not easily compared across datasets. In this paper, we introduce dependent MMD coresets, a data summarization method for collections of datasets that facilitates comparison of distributions. We show that dependent MMD coresets are useful for understanding multiple related datasets and understanding model generalization between such datasets.

APA, Harvard, Vancouver, ISO, and other styles

19

Kramberger, Tin, and Božidar Potočnik. "LSUN-Stanford Car Dataset: Enhancing Large-Scale Car Image Datasets Using Deep Learning for Usage in GAN Training." Applied Sciences 10, no. 14 (July 17, 2020): 4913. http://dx.doi.org/10.3390/app10144913.

Full text

Abstract:

Currently there is no publicly available adequate dataset that could be used for training Generative Adversarial Networks (GANs) on car images. All available car datasets differ in noise, pose, and zoom levels. Thus, the objective of this work was to create an improved car image dataset that would be better suited for GAN training. To improve the performance of the GAN, we coupled the LSUN and Stanford car datasets. A new merged dataset was then pruned in order to adjust zoom levels and reduce the noise of images. This process resulted in fewer images that could be used for training, with increased quality though. This pruned dataset was evaluated by training the StyleGAN with original settings. Pruning the combined LSUN and Stanford datasets resulted in 2,067,710 images of cars with less noise and more adjusted zoom levels. The training of the StyleGAN on the LSUN-Stanford car dataset proved to be superior to the training with just the LSUN dataset by 3.7% using the Fréchet Inception Distance (FID) as a metric. Results pointed out that the proposed LSUN-Stanford car dataset is more consistent and better suited for training GAN neural networks than other currently available large car datasets.

APA, Harvard, Vancouver, ISO, and other styles

20

Alade, Oyekale Abel, Ali Selamat, and Roselina Sallehuddin. "The Effects of Missing Data Characteristics on the Choice of Imputation Techniques." Vietnam Journal of Computer Science 07, no. 02 (March 20, 2020): 161–77. http://dx.doi.org/10.1142/s2196888820500098.

Full text

Abstract:

One major characteristic of data is completeness. Missing data is a significant problem in medical datasets. It leads to incorrect classification of patients and is dangerous to the health management of patients. Many factors lead to the missingness of values in databases in medical datasets. In this paper, we propose the need to examine the causes of missing data in a medical dataset to ensure that the right imputation method is used in solving the problem. The mechanism of missingness in datasets was studied to know the missing pattern of datasets and determine a suitable imputation technique to generate complete datasets. The pattern shows that the missingness of the dataset used in this study is not a monotone missing pattern. Also, single imputation techniques underestimate variance and ignore relationships among the variables; therefore, we used multiple imputations technique that runs in five iterations for the imputation of each missing value. The whole missing values in the dataset were 100% regenerated. The imputed datasets were validated using an extreme learning machine (ELM) classifier. The results show improvement in the accuracy of the imputed datasets. The work can, however, be extended to compare the accuracy of the imputed datasets with the original dataset with different classifiers like support vector machine (SVM), radial basis function (RBF), and ELMs.

APA, Harvard, Vancouver, ISO, and other styles

21

Wu, Qiaoyan, and Yilei Wang. "Comparison of Oceanic Multisatellite Precipitation Data from Tropical Rainfall Measurement Mission and Global Precipitation Measurement Mission Datasets with Rain Gauge Data from Ocean Buoys." Journal of Atmospheric and Oceanic Technology 36, no. 5 (May 2019): 903–20. http://dx.doi.org/10.1175/jtech-d-18-0152.1.

Full text

Abstract:

AbstractThree satellite-derived precipitation datasets [the Tropical Rainfall Measuring Mission Multisatellite Precipitation Analysis (TMPA) dataset, the NOAA Climate Prediction Center morphing technique (CMORPH) dataset, and the newly available Integrated Multisatellite Retrievals for Global Precipitation Measurement (IMERG) dataset] are compared with data obtained from 55 rain gauges mounted on floating buoys in the tropics for the period 1 April 2014–30 April 2017. All three satellite datasets underestimate low rainfall and overestimate high rainfall in the tropical Pacific Ocean, but the TMPA dataset does this the most. In the high-rainfall (higher than 4 mm day−1) Atlantic region, all three satellite datasets overestimate low rainfall and underestimate high rainfall, but the IMERG dataset does this the most. For the Indian Ocean, all three rainfall satellite datasets overestimate rainfall at some gauges and underestimate it at others. Of these three satellite products, IMERG is the most accurate in estimating mean precipitation over the tropical Pacific and Indian Oceans, but it is less accurate over the tropical Atlantic Ocean for regions of high rainfall. The differences between the three satellite datasets vary by region and there is a need to consider uncertainties in the data before using them for research.

APA, Harvard, Vancouver, ISO, and other styles

22

Khakwani, Aamir, Ruth H. Jack, Sally Vernon, Rosie Dickinson, Natasha Wood, Susan Harden, Paul Beckett, Ian Woolhouse, and Richard B. Hubbard. "Apples and pears? A comparison of two sources of national lung cancer audit data in England." ERJ Open Research 3, no. 3 (July 2017): 00003–2017. http://dx.doi.org/10.1183/23120541.00003-2017.

Full text

Abstract:

In 2014, the method of data collection from NHS trusts in England for the National Lung Cancer Audit (NLCA) was changed from a bespoke dataset called LUCADA (Lung Cancer Data). Under the new contract, data are submitted via the Cancer Outcome and Service Dataset (COSD) system and linked additional cancer registry datasets. In 2014, trusts were given opportunity to submit LUCADA data as well as registry data. 132 NHS trusts submitted LUCADA data, and all 151 trusts submitted COSD data. This transitional year therefore provided the opportunity to compare both datasets for data completeness and reliability.We linked the two datasets at the patient level to assess the completeness of key patient and treatment variables. We also assessed the interdata agreement of these variables using Cohen's kappa statistic, κ.We identified 26 001 patients in both datasets. Overall, the recording of sex, age, performance status and stage had more than 90% agreement between datasets, but there were more patients with missing performance status in the registry dataset. Although levels of agreement for surgery, chemotherapy and external-beam radiotherapy were high between datasets, the new COSD system identified more instances of active treatment.There seems to be a high agreement of data between the datasets, and the findings suggest that the registry dataset coupled with COSD provides a richer dataset than LUCADA. However, it lagged behind LUCADA in performance status recording, which needs to improve over time.

APA, Harvard, Vancouver, ISO, and other styles

23

Ferenc, Rudolf, Zoltán Tóth, Gergely Ladányi, István Siket, and Tibor Gyimóthy. "A public unified bug dataset for java and its assessment regarding metrics and bug prediction." Software Quality Journal 28, no. 4 (June 3, 2020): 1447–506. http://dx.doi.org/10.1007/s11219-020-09515-0.

Full text

Abstract:

AbstractBug datasets have been created and used by many researchers to build and validate novel bug prediction models. In this work, our aim is to collect existing public source code metric-based bug datasets and unify their contents. Furthermore, we wish to assess the plethora of collected metrics and the capabilities of the unified bug dataset in bug prediction. We considered 5 public datasets and we downloaded the corresponding source code for each system in the datasets and performed source code analysis to obtain a common set of source code metrics. This way, we produced a unified bug dataset at class and file level as well. We investigated the diversion of metric definitions and values of the different bug datasets. Finally, we used a decision tree algorithm to show the capabilities of the dataset in bug prediction. We found that there are statistically significant differences in the values of the original and the newly calculated metrics; furthermore, notations and definitions can severely differ. We compared the bug prediction capabilities of the original and the extended metric suites (within-project learning). Afterwards, we merged all classes (and files) into one large dataset which consists of 47,618 elements (43,744 for files) and we evaluated the bug prediction model build on this large dataset as well. Finally, we also investigated cross-project capabilities of the bug prediction models and datasets. We made the unified dataset publicly available for everyone. By using a public unified dataset as an input for different bug prediction related investigations, researchers can make their studies reproducible, thus able to be validated and verified.

APA, Harvard, Vancouver, ISO, and other styles

24

Tarek, Mostafa, François P. Brissette, and Richard Arsenault. "Large-Scale Analysis of Global Gridded Precipitation and Temperature Datasets for Climate Change Impact Studies." Journal of Hydrometeorology 21, no. 11 (November 2020): 2623–40. http://dx.doi.org/10.1175/jhm-d-20-0100.1.

Full text

Abstract:

AbstractCurrently, there are a large number of diverse climate datasets in existence, which differ, sometimes greatly, in terms of their data sources, quality control schemes, estimation procedures, and spatial and temporal resolutions. Choosing an appropriate dataset for a given application is therefore not a simple task. This study compares nine global/near-global precipitation datasets and three global temperature datasets over 3138 North American catchments. The chosen datasets all meet the minimum requirement of having at least 30 years of available data, so they could all potentially be used as reference datasets for climate change impact studies. The precipitation datasets include two gauged-only products (GPCC and CPC-Unified), two satellite products corrected using ground-based observations (CHIRPS V2.0 and PERSIANN-CDR V1R1), four reanalysis products (NCEP CFSR, JRA55, ERA-Interim, and ERA5), and one merged product (MSWEP V1.2). The temperature datasets include one gauge-based (CPC-Unified) and two reanalysis (ERA-Interim and ERA5) products. High-resolution gauge-based gridded precipitation and temperature datasets were combined as the reference dataset for this intercomparison study. To assess dataset performance, all combinations were used as inputs to a lumped hydrological model. The results showed that all temperature datasets performed similarly, albeit with the CPC performance being systematically inferior to that of the other three. Significant differences in performance were, however, observed between the precipitation datasets. The MSWEP dataset performed best, followed by the gauge-based, reanalysis, and satellite datasets categories. Results also showed that gauge-based datasets should be preferred in regions with good weather network density, but CHIRPS and ERA5 would be good alternatives in data-sparse regions.

APA, Harvard, Vancouver, ISO, and other styles

25

Chaudhary, Archana, Savita Kolhe, and Raj Kamal. "A hybrid ensemble for classification in multiclass datasets: An application to oilseed disease dataset." Computers and Electronics in Agriculture 124 (June 2016): 65–72. http://dx.doi.org/10.1016/j.compag.2016.03.026.

Full text

APA, Harvard, Vancouver, ISO, and other styles

26

Houskeeper, Henry F., and Raphael M. Kudela. "Ocean Color Quality Control Masks Contain the High Phytoplankton Fraction of Coastal Ocean Observations." Remote Sensing 11, no. 18 (September 18, 2019): 2167. http://dx.doi.org/10.3390/rs11182167.

Full text

Abstract:

Satellite estimation of oceanic chlorophyll-a content has enabled characterization of global phytoplankton stocks, but the quality of retrieval for many ocean color products (including chlorophyll-a) degrades with increasing phytoplankton biomass in eutrophic waters. Quality control of ocean color products is achieved primarily through the application of masks based on standard thresholds designed to identify suspect or low-quality retrievals. This study compares the masked and unmasked fractions of ocean color datasets from two Eastern Boundary Current upwelling ecosystems (the California and Benguela Current Systems) using satellite proxies for phytoplankton biomass that are applicable to satellite imagery without correction for atmospheric aerosols. Evaluation of the differences between the masked and unmasked fractions indicates that high biomass observations are preferentially masked in National Aeronautics and Space Administration (NASA) ocean color datasets as a result of decreased retrieval quality for waters with high concentrations of phytoplankton. This study tests whether dataset modification persists into the default composite data tier commonly disseminated to science end users. Further, this study suggests that statistics describing a dataset’s masked fraction can be helpful in assessing the quality of a composite dataset and in determining the extent to which retrieval quality is linked to biological processes in a given study region.

APA, Harvard, Vancouver, ISO, and other styles

27

Cheng, Shiqiang, Cuiyan Wu, Xin Qi, Li Liu, Mei Ma, Lu Zhang, Bolun Cheng, et al. "A Large-Scale Genetic Correlation Scan Between Intelligence and Brain Imaging Phenotypes." Cerebral Cortex 30, no. 7 (February 28, 2020): 4197–203. http://dx.doi.org/10.1093/cercor/bhaa043.

Full text

Abstract:

Abstract Limited efforts have been paid to evaluate the potential relationships between structural and functional brain imaging and intelligence until now. We performed a two-stage analysis to systematically explore the relationships between 3144 brain image-derived phenotypes (IDPs) and intelligence. First, by integrating genome-wide association studies (GWAS) summaries data of brain IDPs and two GWAS summary datasets of intelligence, we systematically scanned the relationship between each of the 3144 brain IDPs and intelligence through linkage disequilibrium score regression (LDSC) analysis. Second, using the individual-level genotype and intelligence data of 160 124 subjects derived from UK Biobank datasets, polygenetic risk scoring (PRS) analysis was performed to replicate the common significant associations of the first stage. In the first stage, LDSC identified 6 and 2 significant brain IDPs significantly associated with intelligence dataset1 and dataset2, respectively. It is interesting that NET100_0624 showed genetic correlations with intelligence in the two datasets of intelligence. After adjusted for age and sex as the covariates, NET100_0624 (P = 5.26 × 10−20, Pearson correlation coefficients = −0.02) appeared to be associated with intelligence by PRS analysis of UK Biobank samples. Our findings may help to understand the genetic mechanisms of the effects of brain structure and function on the development of intelligence.

APA, Harvard, Vancouver, ISO, and other styles

28

Hayashi, Yoichi. "Does Deep Learning Work Well for Categorical Datasets with Mainly Nominal Attributes?" Electronics 9, no. 11 (November 21, 2020): 1966. http://dx.doi.org/10.3390/electronics9111966.

Full text

Abstract:

Given the complexity of real-world datasets, it is difficult to present data structures using existing deep learning (DL) models. Most research to date has concentrated on datasets with only one type of attribute: categorical or numerical. Categorical data are common in datasets such as the German (-categorical) credit scoring dataset, which contains numerical, ordinal, and nominal attributes. The heterogeneous structure of this dataset makes very high accuracy difficult to achieve. DL-based methods have achieved high accuracy (99.68%) for the Wisconsin Breast Cancer Dataset, whereas DL-inspired methods have achieved high accuracy (97.39%) for the Australian credit dataset. However, to our knowledge, no such method has been proposed to classify the German credit dataset. This study aimed to provide new insights into the reasons why DL-based and DL-inspired classifiers do not work well for categorical datasets, mainly consisting of nominal attributes. We also discuss the problems associated with using nominal attributes to design high-performance classifiers. Considering the expanded utility of DL, this study's findings should aid in the development of a new type of DL that can handle categorical datasets consisting of mainly nominal attributes, which are commonly used in risk evaluation, finance, banking, and marketing.

APA, Harvard, Vancouver, ISO, and other styles

29

Bajamgnigni Gbambie, Abdas Salam, Annie Poulin, Marie-Amélie Boucher, and Richard Arsenault. "Added Value of Alternative Information in Interpolated Precipitation Datasets for Hydrology." Journal of Hydrometeorology 18, no. 1 (January 1, 2017): 247–64. http://dx.doi.org/10.1175/jhm-d-16-0032.1.

Full text

Abstract:

Abstract Gridded climate datasets are produced in many parts of the world by applying various interpolation methods to weather observations, to which are sometimes added secondary information (in addition to geographic location) such as topography and radar or atmospheric model outputs. For a region of interest, the choice of a dataset for a given study can be a significant challenge given the lack of information on the similarities and differences that exist between datasets, or about the benefits that one dataset may present relative to another. This study aims to provide information on the spatial and temporal differences between gridded precipitation datasets and their implication for hydrological modeling. Three gridded datasets for the province of Quebec are considered: the Natural Resources Canada (NRCan) dataset, the Canadian Precipitation Analysis (CaPA) dataset, and the dataset from the Ministère du Développement Durable, de l’Environnement et de la Lutte contre les Changements Climatiques du Québec (MDDELCC). Using statistical metrics and diagrams, these precipitation datasets are compared with each other. Hydrological responses of 181 Quebec watersheds with respect to each gridded precipitation dataset are also analyzed using the hydrological model HSAMI. The results indicate strong similarities in the southern parts and disparities in the central and northern parts of the province of Quebec. Analysis of hydrological simulations indicates that the CaPA dataset offers the best results, particularly for watersheds located in the central and northern parts of the province. MDDELCC shows the best performance in watersheds located on the south shore of the St. Lawrence River and comes out as the overall second-best option.

APA, Harvard, Vancouver, ISO, and other styles

30

Moon, Myungjin, and Kenta Nakai. "Integrative analysis of gene expression and DNA methylation using unsupervised feature extraction for detecting candidate cancer biomarkers." Journal of Bioinformatics and Computational Biology 16, no. 02 (April 2018): 1850006. http://dx.doi.org/10.1142/s0219720018500063.

Full text

Abstract:

Currently, cancer biomarker discovery is one of the important research topics worldwide. In particular, detecting significant genes related to cancer is an important task for early diagnosis and treatment of cancer. Conventional studies mostly focus on genes that are differentially expressed in different states of cancer; however, noise in gene expression datasets and insufficient information in limited datasets impede precise analysis of novel candidate biomarkers. In this study, we propose an integrative analysis of gene expression and DNA methylation using normalization and unsupervised feature extractions to identify candidate biomarkers of cancer using renal cell carcinoma RNA-seq datasets. Gene expression and DNA methylation datasets are normalized by Box–Cox transformation and integrated into a one-dimensional dataset that retains the major characteristics of the original datasets by unsupervised feature extraction methods, and differentially expressed genes are selected from the integrated dataset. Use of the integrated dataset demonstrated improved performance as compared with conventional approaches that utilize gene expression or DNA methylation datasets alone. Validation based on the literature showed that a considerable number of top-ranked genes from the integrated dataset have known relationships with cancer, implying that novel candidate biomarkers can also be acquired from the proposed analysis method. Furthermore, we expect that the proposed method can be expanded for applications involving various types of multi-omics datasets.

APA, Harvard, Vancouver, ISO, and other styles

31

Đokić, Nikola, Borislava Blagojević, and Vladislava Mihailović. "Missing data representation by perception thresholds in flood flow frequency assessment." Journal of Applied Engineering Science 19, no. 2 (2021): 432–38. http://dx.doi.org/10.5937/jaes0-28902.

Full text

Abstract:

Flood flow frequency analysis (FFA) plays one of the key roles in many fields of hydraulic engineering and water resources management. The reliability of FFA results depends on many factors, an obvious one being the reliability of the input data - datasets of the annual peak flow. In practice, however, engineers often encounter the problem of incomplete datasets (missing data, data gaps and/or broken records) which increases the uncertainty of FFA results. In this paper, we perform at-site focused analysis, and we use a complete dataset of annual peak flows from 1931 to 2016 at the hydrologic station Senta of the Tisa (Tisza) river as the reference dataset. From this original dataset we remove some data and thus we obtain 15 new datasets with one continuous gap of different length and/or location. Each dataset we further subject to FFA by using the USACE HEC-SSP Bulletin 17C analysis, where we apply perception thresholds for missing data representation. We vary perception threshold lower bound for all missing flows in one dataset, so that we create 56 variants of the input HEC-SSP datasets. The flood flow quantiles assessed from the datasets with missing data and different perception thresholds we evaluate by two uncertainty measures. The results indicate acceptable flood quantile estimates are obtained, even for larger return periods, by setting a lower perception threshold bound at the value of the highest peak flow in the available - incomplete dataset.

APA, Harvard, Vancouver, ISO, and other styles

32

Mabuni, D., and S. Aquter Babu. "High Accurate and a Variant of k-fold Cross Validation Technique for Predicting the Decision Tree Classifier Accuracy." International Journal of Innovative Technology and Exploring Engineering 10, no. 2 (January 10, 2021): 105–10. http://dx.doi.org/10.35940/ijitee.c8403.0110321.

Full text

Abstract:

In machine learning data usage is the most important criterion than the logic of the program. With very big and moderate sized datasets it is possible to obtain robust and high classification accuracies but not with small and very small sized datasets. In particular only large training datasets are potential datasets for producing robust decision tree classification results. The classification results obtained by using only one training and one testing dataset pair are not reliable. Cross validation technique uses many random folds of the same dataset for training and validation. In order to obtain reliable and statistically correct classification results there is a need to apply the same algorithm on different pairs of training and validation datasets. To overcome the problem of the usage of only a single training dataset and a single testing dataset the existing k-fold cross validation technique uses cross validation plan for obtaining increased decision tree classification accuracy results. In this paper a new cross validation technique called prime fold is proposed and it is experimentally tested thoroughly and then verified correctly using many bench mark UCI machine learning datasets. It is observed that the prime fold based decision tree classification accuracy results obtained after experimentation are far better than the existing techniques of finding decision tree classification accuracies.

APA, Harvard, Vancouver, ISO, and other styles

33

Di, Yanghua, Zhiguo Jiang, and Haopeng Zhang. "A Public Dataset for Fine-Grained Ship Classification in Optical Remote Sensing Images." Remote Sensing 13, no. 4 (February 18, 2021): 747. http://dx.doi.org/10.3390/rs13040747.

Full text

Abstract:

Fine-grained visual categorization (FGVC) is an important and challenging problem due to large intra-class differences and small inter-class differences caused by deformation, illumination, angles, etc. Although major advances have been achieved in natural images in the past few years due to the release of popular datasets such as the CUB-200-2011, Stanford Cars and Aircraft datasets, fine-grained ship classification in remote sensing images has been rarely studied because of relative scarcity of publicly available datasets. In this paper, we investigate a large amount of remote sensing image data of sea ships and determine most common 42 categories for fine-grained visual categorization. Based our previous DSCR dataset, a dataset for ship classification in remote sensing images, we collect more remote sensing images containing warships and civilian ships of various scales from Google Earth and other popular remote sensing image datasets including DOTA, HRSC2016, NWPU VHR-10, We call our dataset FGSCR-42, meaning a dataset for Fine-Grained Ship Classification in Remote sensing images with 42 categories. The whole dataset of FGSCR-42 contains 9320 images of most common types of ships. We evaluate popular object classification algorithms and fine-grained visual categorization algorithms to build a benchmark. Our FGSCR-42 dataset is publicly available at our webpages.

APA, Harvard, Vancouver, ISO, and other styles

34

Dlamini, Nkosikhona, and Terence L. van Zyl. "Comparing Class-Aware and Pairwise Loss Functions for Deep Metric Learning in Wildlife Re-Identification." Sensors 21, no. 18 (September 12, 2021): 6109. http://dx.doi.org/10.3390/s21186109.

Full text

Abstract:

Similarity learning using deep convolutional neural networks has been applied extensively in solving computer vision problems. This attraction is supported by its success in one-shot and zero-shot classification applications. The advances in similarity learning are essential for smaller datasets or datasets in which few class labels exist per class such as wildlife re-identification. Improving the performance of similarity learning models comes with developing new sampling techniques and designing loss functions better suited to training similarity in neural networks. However, the impact of these advances is tested on larger datasets, with limited attention given to smaller imbalanced datasets such as those found in unique wildlife re-identification. To this end, we test the advances in loss functions for similarity learning on several animal re-identification tasks. We add two new public datasets, Nyala and Lions, to the challenge of animal re-identification. Our results are state of the art on all public datasets tested except Pandas. The achieved Top-1 Recall is 94.8% on the Zebra dataset, 72.3% on the Nyala dataset, 79.7% on the Chimps dataset and, on the Tiger dataset, it is 88.9%. For the Lion dataset, we set a new benchmark at 94.8%. We find that the best performing loss function across all datasets is generally the triplet loss; however, there is only a marginal improvement compared to the performance achieved by Proxy-NCA models. We demonstrate that no single neural network architecture combined with a loss function is best suited for all datasets, although VGG-11 may be the most robust first choice. Our results highlight the need for broader experimentation and exploration of loss functions and neural network architecture for the more challenging task, over classical benchmarks, of wildlife re-identification.

APA, Harvard, Vancouver, ISO, and other styles

35

Sarma, Karthik V., Alex G. Raman, Nikhil J. Dhinagar, Alan M. Priester, Stephanie Harmon, Thomas Sanford, Sherif Mehralivand, et al. "Harnessing clinical annotations to improve deep learning performance in prostate segmentation." PLOS ONE 16, no. 6 (June 25, 2021): e0253829. http://dx.doi.org/10.1371/journal.pone.0253829.

Full text

Abstract:

Purpose Developing large-scale datasets with research-quality annotations is challenging due to the high cost of refining clinically generated markup into high precision annotations. We evaluated the direct use of a large dataset with only clinically generated annotations in development of high-performance segmentation models for small research-quality challenge datasets. Materials and methods We used a large retrospective dataset from our institution comprised of 1,620 clinically generated segmentations, and two challenge datasets (PROMISE12: 50 patients, ProstateX-2: 99 patients). We trained a 3D U-Net convolutional neural network (CNN) segmentation model using our entire dataset, and used that model as a template to train models on the challenge datasets. We also trained versions of the template model using ablated proportions of our dataset, and evaluated the relative benefit of those templates for the final models. Finally, we trained a version of the template model using an out-of-domain brain cancer dataset, and evaluated the relevant benefit of that template for the final models. We used five-fold cross-validation (CV) for all training and evaluation across our entire dataset. Results Our model achieves state-of-the-art performance on our large dataset (mean overall Dice 0.916, average Hausdorff distance 0.135 across CV folds). Using this model as a pre-trained template for refining on two external datasets significantly enhanced performance (30% and 49% enhancement in Dice scores respectively). Mean overall Dice and mean average Hausdorff distance were 0.912 and 0.15 for the ProstateX-2 dataset, and 0.852 and 0.581 for the PROMISE12 dataset. Using even small quantities of data to train the template enhanced performance, with significant improvements using 5% or more of the data. Conclusion We trained a state-of-the-art model using unrefined clinical prostate annotations and found that its use as a template model significantly improved performance in other prostate segmentation tasks, even when trained with only 5% of the original dataset.

APA, Harvard, Vancouver, ISO, and other styles

36

Madden, Frances, Jan Ashton, and Jez Cope. "Building the Picture Behind a Dataset." International Journal of Digital Curation 15, no. 1 (December 31, 2020): 9. http://dx.doi.org/10.2218/ijdc.v15i1.702.

Full text

Abstract:

As part of the European Commission funded FREYA project The British Library wanted to explore the possibility of developing provenance information in datasets derived from the British Library’s collections, the data.bl.uk collection. Provenance information is defined in this context as ‘information relating to the origin, source and curation of the datasets’. Provenance information is also identified within the FAIR principles as an important aspect of being able to reuse and understand research datasets. According to the FAIR principles, the aim is to understand how to cite and acknowledge the dataset as well as understanding how the dataset was created and has been processed. There is also reference to the importance of this metadata being machine readable. By enhancing the metadata of these datasets with additional persistent identifiers and metadata a fuller picture of the datasets and their content could be understood. This also adds to the veracity and understanding the dataset by end users of data.bl.uk.

APA, Harvard, Vancouver, ISO, and other styles

37

Archila Bustos, Maria Francisca, Ola Hall, Thomas Niedomysl, and Ulf Ernstson. "A pixel level evaluation of five multitemporal global gridded population datasets: a case study in Sweden, 1990–2015." Population and Environment 42, no. 2 (September 1, 2020): 255–77. http://dx.doi.org/10.1007/s11111-020-00360-8.

Full text

Abstract:

Abstract Human activity is a major driver of change and has contributed to many of the challenges we face today. Detailed information about human population distribution is fundamental and use of freely available, high-resolution, gridded datasets on global population as a source of such information is increasing. However, there is little research to guide users in dataset choice. This study evaluates five of the most commonly used global gridded population datasets against a high-resolution Swedish population dataset on a pixel level. We show that datasets which employ more complex modeling techniques exhibit lower errors overall but no one dataset performs best under all situations. Furthermore, differences exist in how unpopulated areas are identified and changes in algorithms over time affect accuracy. Our results provide guidance in navigating the differences between the most commonly used gridded population datasets and will help researchers and policy makers identify the most suitable datasets under varying conditions.

APA, Harvard, Vancouver, ISO, and other styles

38

Shi, Lingfei, and Feng Ling. "Local Climate Zone Mapping Using Multi-Source Free Available Datasets on Google Earth Engine Platform." Land 10, no. 5 (April 23, 2021): 454. http://dx.doi.org/10.3390/land10050454.

Full text

Abstract:

As one of the widely concerned urban climate issues, urban heat island (UHI) has been studied using the local climate zone (LCZ) classification scheme in recent years. More and more effort has been focused on improving LCZ mapping accuracy. It has become a prevalent trend to take advantage of multi-source images in LCZ mapping. To this end, this paper tried to utilize multi-source freely available datasets: Sentinel-2 multispectral instrument (MSI), Sentinel-1 synthetic aperture radar (SAR), Luojia1-01 nighttime light (NTL), and Open Street Map (OSM) datasets to produce the 10 m LCZ classification result using Google Earth Engine (GEE) platform. Additionally, the derived datasets of Sentinel-2 MSI data were also exploited in LCZ classification, such as spectral indexes (SI) and gray-level co-occurrence matrix (GLCM) datasets. The different dataset combinations were designed to evaluate the particular dataset’s contribution to LCZ classification. It was found that: (1) The synergistic use of Sentinel-2 MSI and Sentinel-1 SAR data can improve the accuracy of LCZ classification; (2) The multi-seasonal information of Sentinel data also has a good contribution to LCZ classification; (3) OSM, GLCM, SI, and NTL datasets have some positive contribution to LCZ classification when individually adding them to the seasonal Sentinel-1 and Sentinel-2 datasets; (4) It is not an absolute right way to improve LCZ classification accuracy by combining as many datasets as possible. With the help of the GEE, this study provides the potential to generate more accurate LCZ mapping on a large scale, which is significant for urban development.

APA, Harvard, Vancouver, ISO, and other styles

39

Tran, Thi-Dung, Junghee Kim, Ngoc-Huynh Ho, Hyung-Jeong Yang, Sudarshan Pant, Soo-Hyung Kim, and Guee-Sang Lee. "Stress Analysis with Dimensions of Valence and Arousal in the Wild." Applied Sciences 11, no. 11 (June 3, 2021): 5194. http://dx.doi.org/10.3390/app11115194.

Full text

Abstract:

In the field of stress recognition, the majority of research has conducted experiments on datasets collected from controlled environments with limited stressors. As these datasets cannot represent real-world scenarios, stress identification and analysis are difficult. There is a dire need for reliable, large datasets that are specifically acquired for stress emotion with varying degrees of expression for this task. In this paper, we introduced a dataset for Stress Analysis with Dimensions of Valence and Arousal of Korean Movie in Wild (SADVAW), which includes video clips with diversity in facial expressions from different Korean movies. The SADVAW dataset contains continuous dimensions of valence and arousal. We presented a detailed statistical analysis of the dataset. We also analyzed the correlation between stress and continuous dimensions. Moreover, using the SADVAW dataset, we trained a deep learning-based model for stress recognition.

APA, Harvard, Vancouver, ISO, and other styles

40

Shaon, Arif, Sarah Callaghan, Bryan Lawrence, Brian Matthews, Timothy Osborn, Colin Harpham, and Andrew Woolf. "Opening Up Climate Research: A Linked Data Approach to Publishing Data Provenance." International Journal of Digital Curation 7, no. 1 (March 12, 2012): 163–73. http://dx.doi.org/10.2218/ijdc.v7i1.223.

Full text

Abstract:

Traditionally, the formal scientific output in most fields of natural science has been limited to peer-reviewed academic journal publications, with less attention paid to the chain of intermediate data results and their associated metadata, including provenance. In effect, this has constrained the representation and verification of the data provenance to the confines of the related publications. Detailed knowledge of a dataset’s provenance is essential to establish the pedigree of the data for its effective re-use, and to avoid redundant re-enactment of the experiment or computation involved. It is increasingly important for open-access data to determine their authenticity and quality, especially considering the growing volumes of datasets appearing in the public domain. To address these issues, we present an approach that combines the Digital Object Identifier (DOI) – a widely adopted citation technique – with existing, widely adopted climate science data standards to formally publish detailed provenance of a climate research dataset as an associated scientific workflow. This is integrated with linked-data compliant data re-use standards (e.g. OAI-ORE) to enable a seamless link between a publication and the complete trail of lineage of the corresponding dataset, including the dataset itself.

APA, Harvard, Vancouver, ISO, and other styles

41

Xie, Ning-Ning, Fang-Fang Wang, Jue Zhou, Chang Liu, and Fan Qu. "Establishment and Analysis of a Combined Diagnostic Model of Polycystic Ovary Syndrome with Random Forest and Artificial Neural Network." BioMed Research International 2020 (August 20, 2020): 1–13. http://dx.doi.org/10.1155/2020/2613091.

Full text

Abstract:

Polycystic ovary syndrome (PCOS) is one of the most common metabolic and reproductive endocrinopathies. However, few studies have tried to develop a diagnostic model based on gene biomarkers. In this study, we applied a computational method by combining two machine learning algorithms, including random forest (RF) and artificial neural network (ANN), to identify gene biomarkers and construct diagnostic model. We collected gene expression data from Gene Expression Omnibus (GEO) database containing 76 PCOS samples and 57 normal samples; five datasets were utilized, including one dataset for screening differentially expressed genes (DEGs), two training datasets, and two validation datasets. Firstly, based on RF, 12 key genes in 264 DEGs were identified to be vital for classification of PCOS and normal samples. Moreover, the weights of these key genes were calculated using ANN with microarray and RNA-seq training dataset, respectively. Furthermore, the diagnostic models for two types of datasets were developed and named neuralPCOS. Finally, two validation datasets were used to test and compare the performance of neuralPCOS with other two set of marker genes by area under curve (AUC). Our model achieved an AUC of 0.7273 in microarray dataset, and 0.6488 in RNA-seq dataset. To conclude, we uncovered gene biomarkers and developed a novel diagnostic model of PCOS, which would be helpful for diagnosis.

APA, Harvard, Vancouver, ISO, and other styles

42

Wang, Xiaoqing, Xiangjun Wang, and Yubo Ni. "Unsupervised Domain Adaptation for Facial Expression Recognition Using Generative Adversarial Networks." Computational Intelligence and Neuroscience 2018 (July 9, 2018): 1–10. http://dx.doi.org/10.1155/2018/7208794.

Full text

Abstract:

In the facial expression recognition task, a good-performing convolutional neural network (CNN) model trained on one dataset (source dataset) usually performs poorly on another dataset (target dataset). This is because the feature distribution of the same emotion varies in different datasets. To improve the cross-dataset accuracy of the CNN model, we introduce an unsupervised domain adaptation method, which is especially suitable for unlabelled small target dataset. In order to solve the problem of lack of samples from the target dataset, we train a generative adversarial network (GAN) on the target dataset and use the GAN generated samples to fine-tune the model pretrained on the source dataset. In the process of fine-tuning, we give the unlabelled GAN generated samples distributed pseudolabels dynamically according to the current prediction probabilities. Our method can be easily applied to any existing convolutional neural networks (CNN). We demonstrate the effectiveness of our method on four facial expression recognition datasets with two CNN structures and obtain inspiring results.

APA, Harvard, Vancouver, ISO, and other styles

43

Page, Roderic. "Liberating links between datasets using lightweight data publishing: an example using plant names and the taxonomic literature." Biodiversity Data Journal 6 (July 23, 2018): e27539. http://dx.doi.org/10.3897/bdj.6.e27539.

Full text

Abstract:

Constructing a biodiversity knowledge graph will require making millions of cross links between diversity entities in different datasets. Researchers trying to bootstrap the growth of the biodiversity knowledge graph by constructing databases of links between these entities lack obvious ways to publish these sets of links. One appealing and lightweight approach is to create a "datasette", a database that is wrapped together with a simple web server that enables users to query the data. Datasettes can be packaged into Docker containers and hosted online with minimal effort. This approach is illustrated using a dataset of links between globally unique identifiers for plant taxonomic namesand identifiers for the taxonomic articles that published those names.

APA, Harvard, Vancouver, ISO, and other styles

44

Vincke, S., and M. Vergauwen. "GEO-REGISTERING CONSECUTIVE DATASETS BY MEANS OF A REFERENCE DATASET, ELIMINATING GROUND CONTROL POINT INDICATION." ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences XLII-5/W2 (September 20, 2019): 85–91. http://dx.doi.org/10.5194/isprs-archives-xlii-5-w2-85-2019.

Full text

Abstract:

<p><strong>Abstract.</strong> The architecture, engineering and construction (AEC) industry’s interest in more advanced ways of regular monitoring of construction site activities and the achieved building progress has been rising recently. This requires frequent recordings of the area. This is only feasible if the profound observations only require limited time, both for the actual capturing on-site as well as processing of the recorded data. Moreover, for monitoring purposes, it is vital that all datasets use a single, unique reference system. This allows for an easy comparison of various observations to determine both building progress as well as possible construction deviations or errors.</p><p>In this work, a framework is proposed that facilitates a faster and more efficient way of co-registering or geo-registering consecutive datasets. It comprises three major stages, starting with the capturing of the surroundings of the construction site. By thoroughly adding numerous ground control points (GCPs) in a second phase, the processed result of this input data can be considered as a reference dataset. In a third stage, this known component is used as additional input for the processing of subsequently captured datasets. Using overlapping areas, the new observations can be immediately transferred to the correct reference system. This eliminates the indication of GCPs in subsequent datasets, which is known to be time-consuming and error-prone.</p><p>Although in this work the focus of the proposed framework lies on a photogrammetric recording approach, it also is applicable for laser scanning. Its potential is showcased on a real-world apartment construction site in Ghent, Belgium. In the test case, the presented approach is shown to be efficient, with comparable accuracies as other current methods, however, requiring less time and effort.</p>

APA, Harvard, Vancouver, ISO, and other styles

45

Hou, Yu-Tai, Kenneth A. Campana, Kenneth E. Mitchell, Shi-Keng Yang, and Larry L. Stowe. "Comparison of an Experimental NOAA AVHRR Cloud Dataset with Other Observed and Forecast Cloud Datasets." Journal of Atmospheric and Oceanic Technology 10, no. 6 (December 1993): 833–49. http://dx.doi.org/10.1175/1520-0426(1993)010<0833:coaena>2.0.co;2.

Full text

APA, Harvard, Vancouver, ISO, and other styles

46

Bolón-Canedo, V., N. Sánchez-Maroño, and A. Alonso-Betanzos. "Feature selection and classification in multiple class datasets: An application to KDD Cup 99 dataset." Expert Systems with Applications 38, no. 5 (May 2011): 5947–57. http://dx.doi.org/10.1016/j.eswa.2010.11.028.

Full text

APA, Harvard, Vancouver, ISO, and other styles

47

Jittawiriyanukoon, Chanintorn. "Granularity analysis of classification and estimation for complex datasets with MOA." International Journal of Electrical and Computer Engineering (IJECE) 9, no. 1 (February 1, 2019): 409. http://dx.doi.org/10.11591/ijece.v9i1.pp409-416.

Full text

Abstract:

<span>Dispersed and unstructured datasets are substantial parameters to realize an exact amount of the required space. Depending upon the size and the data distribution, especially, if the classes are significantly associating, the level of granularity to agree a precise classification of the datasets exceeds. The data complexity is one of the major attributes to govern the proper value of the granularity, as it has a direct impact on the performance. Dataset classification exhibits the vital step in complex data analytics and designs to ensure that dataset is prompt to be efficiently scrutinized. Data collections are always causing missing, noisy and out-of-the-range values. Data analytics which has not been wisely classified for problems as such can induce unreliable outcomes. Hence, classifications for complex data sources help comfort the accuracy of gathered datasets by machine learning algorithms. Dataset complexity and pre-processing time reflect the effectiveness of individual algorithm. Once the complexity of datasets is characterized then comparatively simpler datasets can further investigate with parallelism approach. Speedup performance is measured by the execution of MOA simulation. Our proposed classification approach outperforms and improves granularity level of complex datasets.</span>

APA, Harvard, Vancouver, ISO, and other styles

48

Huč, Aleks, Jakob Šalej, and Mira Trebar. "Analysis of Machine Learning Algorithms for Anomaly Detection on Edge Devices." Sensors 21, no. 14 (July 20, 2021): 4946. http://dx.doi.org/10.3390/s21144946.

Full text

Abstract:

The Internet of Things (IoT) consists of small devices or a network of sensors, which permanently generate huge amounts of data. Usually, they have limited resources, either computing power or memory, which means that raw data are transferred to central systems or the cloud for analysis. Lately, the idea of moving intelligence to the IoT is becoming feasible, with machine learning (ML) moved to edge devices. The aim of this study is to provide an experimental analysis of processing a large imbalanced dataset (DS2OS), split into a training dataset (80%) and a test dataset (20%). The training dataset was reduced by randomly selecting a smaller number of samples to create new datasets Di (i = 1, 2, 5, 10, 15, 20, 40, 60, 80%). Afterwards, they were used with several machine learning algorithms to identify the size at which the performance metrics show saturation and classification results stop improving with an F1 score equal to 0.95 or higher, which happened at 20% of the training dataset. Further on, two solutions for the reduction of the number of samples to provide a balanced dataset are given. In the first, datasets DRi consist of all anomalous samples in seven classes and a reduced majority class (‘NL’) with i = 0.1, 0.2, 0.5, 1, 2, 5, 10, 15, 20 percent of randomly selected samples. In the second, datasets DCi are generated from the representative samples determined with clustering from the training dataset. All three dataset reduction methods showed comparable performance results. Further evaluation of training times and memory usage on Raspberry Pi 4 shows a possibility to run ML algorithms with limited sized datasets on edge devices.

APA, Harvard, Vancouver, ISO, and other styles

49

Guo, Rui, Yi-Qin Wang, Jin Xu, Hai-Xia Yan, Jian-Jun Yan, Fu-Feng Li, Zhao-Xia Xu, and Wen-Jie Xu. "Research on Zheng Classification Fusing Pulse Parameters in Coronary Heart Disease." Evidence-Based Complementary and Alternative Medicine 2013 (2013): 1–8. http://dx.doi.org/10.1155/2013/602672.

Full text

Abstract:

This study was conducted to illustrate that nonlinear dynamic variables of Traditional Chinese Medicine (TCM) pulse can improve the performances of TCM Zheng classification models. Pulse recordings of 334 coronary heart disease (CHD) patients and 117 normal subjects were collected in this study. Recurrence quantification analysis (RQA) was employed to acquire nonlinear dynamic variables of pulse. TCM Zheng models in CHD were constructed, and predictions using a novel multilabel learning algorithm based on different datasets were carried out. Datasets were designed as follows:dataset1, TCM inquiry information including inspection information;dataset2, time-domain variables of pulse anddataset1;dataset3, RQA variables of pulse anddataset1; anddataset4, major principal components of RQA variables anddataset1. The performances of the different models for Zheng differentiation were compared. The model for Zheng differentiation based on RQA variables integrated with inquiry information had the best performance, whereas that based only on inquiry had the worst performance. Meanwhile, the model based on time-domain variables of pulse integrated with inquiry fell between the above two. This result showed that RQA variables of pulse can be used to construct models of TCM Zheng and improve the performance of Zheng differentiation models.

APA, Harvard, Vancouver, ISO, and other styles

50

Sharma, Vijeta, Manjari Gupta, Ajai Kumar, and Deepti Mishra. "EduNet: A New Video Dataset for Understanding Human Activity in the Classroom Environment." Sensors 21, no. 17 (August 24, 2021): 5699. http://dx.doi.org/10.3390/s21175699.

Full text

Abstract:

Human action recognition in videos has become a popular research area in artificial intelligence (AI) technology. In the past few years, this research has accelerated in areas such as sports, daily activities, kitchen activities, etc., due to developments in the benchmarks proposed for human action recognition datasets in these areas. However, there is little research in the benchmarking datasets for human activity recognition in educational environments. Therefore, we developed a dataset of teacher and student activities to expand the research in the education domain. This paper proposes a new dataset, called EduNet, for a novel approach towards developing human action recognition datasets in classroom environments. EduNet has 20 action classes, containing around 7851 manually annotated clips extracted from YouTube videos, and recorded in an actual classroom environment. Each action category has a minimum of 200 clips, and the total duration is approximately 12 h. To the best of our knowledge, EduNet is the first dataset specially prepared for classroom monitoring for both teacher and student activities. It is also a challenging dataset of actions as it has many clips (and due to the unconstrained nature of the clips). We compared the performance of the EduNet dataset with benchmark video datasets UCF101 and HMDB51 on a standard I3D-ResNet-50 model, which resulted in 72.3% accuracy. The development of a new benchmark dataset for the education domain will benefit future research concerning classroom monitoring systems. The EduNet dataset is a collection of classroom activities from 1 to 12 standard schools.

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!