Log in

Relevant bibliographies by topics / Selected subset of training data / Journal articles

To see the other types of publications on this topic, follow the link: Selected subset of training data.

Journal articles on the topic 'Selected subset of training data'

Author: Grafiati

Published: 6 September 2023

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Selected subset of training data.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Liu, Xiao Fang, and Chun Yang. "Training Data Reduction and Classification Based on Greedy Kernel Principal Component Analysis and Fuzzy C-Means Algorithm." Applied Mechanics and Materials 347-350 (August 2013): 2390–94. http://dx.doi.org/10.4028/www.scientific.net/amm.347-350.2390.

Full text

Abstract:

Nonlinear feature extraction used standard Kernel Principal Component Analysis (KPCA) method has large memories and high computational complexity in large datasets. A Greedy Kernel Principal Component Analysis (GKPCA) method is applied to reduce training data and deal with the nonlinear feature extraction problem for training data of large data in classification. First, a subset, which approximates to the original training data, is selected from the full training data using the greedy technique of the GKPCA method. Then, the feature extraction model is trained by the subset instead of the full training data. Finally, FCM algorithm classifies feature extraction data of the GKPCA, KPCA and PCA methods, respectively. The simulation results indicate that the feature extraction performance of both the GKPCA, and KPCA methods outperform the PCA method. In addition of retaining the performance of the KPCA method, the GKPCA method reduces computational complexity due to the reduced training set in classification.

APA, Harvard, Vancouver, ISO, and other styles

2

Yu, Siwei, Jianwei Ma, and Stanley Osher. "Monte Carlo data-driven tight frame for seismic data recovery." GEOPHYSICS 81, no. 4 (July 2016): V327—V340. http://dx.doi.org/10.1190/geo2015-0343.1.

Full text

Abstract:

Seismic data denoising and interpolation are essential preprocessing steps in any seismic data processing chain. Sparse transforms with a fixed basis are often used in these two steps. Recently, we have developed an adaptive learning method, the data-driven tight frame (DDTF) method, for seismic data denoising and interpolation. With its adaptability to seismic data, the DDTF method achieves high-quality recovery. For 2D seismic data, the DDTF method is much more efficient than traditional dictionary learning methods. But for 3D or 5D seismic data, the DDTF method results in a high computational expense. The motivation behind this work is to accelerate the filter bank training process in DDTF, while doing less damage to the recovery quality. The most frequently used method involves only a randomly selective subset of the training set. However, this random selection method uses no prior information of the data. We have designed a new patch selection method for DDTF seismic data recovery. We suppose that patches with higher variance contain more information related to complex structures, and should be selected into the training set with higher probability. First, we calculate the variance of all available patches. Then for each patch, a uniformly distributed random number is generated and the patch is preserved if its variance is greater than the random number. Finally, all selected patches are used for filter bank training. We call this procedure the Monte Carlo DDTF method. We have tested the trained filter bank on seismic data denoising and interpolation. Numerical results using this Monte Carlo DDTF method surpass random or regular patch selection DDTF when the sizes of the training sets are the same. We have also used state-of-the-art methods based on the curvelet transform, block matching 4D, and multichannel singular spectrum analysis as comparisons when dealing with field data.

APA, Harvard, Vancouver, ISO, and other styles

3

Ukil, Arijit, Leandro Marin, and Antonio J. Jara. "When less is more powerful: Shapley value attributed ablation with augmented learning for practical time series sensor data classification." PLOS ONE 17, no. 11 (November 23, 2022): e0277975. http://dx.doi.org/10.1371/journal.pone.0277975.

Full text

Abstract:

Time series sensor data classification tasks often suffer from training data scarcity issue due to the expenses associated with the expert-intervened annotation efforts. For example, Electrocardiogram (ECG) data classification for cardio-vascular disease (CVD) detection requires expensive labeling procedures with the help of cardiologists. Current state-of-the-art algorithms like deep learning models have shown outstanding performance under the general requirement of availability of large set of training examples. In this paper, we propose Shapley Attributed Ablation with Augmented Learning: ShapAAL, which demonstrates that deep learning algorithm with suitably selected subset of the seen examples or ablating the unimportant ones from the given limited training dataset can ensure consistently better classification performance under augmented training. In ShapAAL, additive perturbed training augments the input space to compensate the scarcity in training examples using Residual Network (ResNet) architecture through perturbation-induced inputs, while Shapley attribution seeks the subset from the augmented training space for better learnability with the goal of better general predictive performance, thanks to the “efficiency” and “null player” axioms of transferable utility games upon which Shapley value game is formulated. In ShapAAL, the subset of training examples that contribute positively to a supervised learning setup is derived from the notion of coalition games using Shapley values associated with each of the given inputs’ contribution into the model prediction. ShapAAL is a novel push-pull deep architecture where the subset selection through Shapley value attribution pushes the model to lower dimension while augmented training augments the learning capability of the model over unseen data. We perform ablation study to provide the empirical evidence of our claim and we show that proposed ShapAAL method consistently outperforms the current baselines and state-of-the-art algorithms for time series sensor data classification tasks from publicly available UCR time series archive that includes different practical important problems like detection of CVDs from ECG data.

APA, Harvard, Vancouver, ISO, and other styles

4

Hampson, Daniel P., James S. Schuelke, and John A. Quirein. "Use of multiattribute transforms to predict log properties from seismic data." GEOPHYSICS 66, no. 1 (January 2001): 220–36. http://dx.doi.org/10.1190/1.1444899.

Full text

Abstract:

We describe a new method for predicting well‐log properties from seismic data. The analysis data consist of a series of target logs from wells which tie a 3-D seismic volume. The target logs theoretically may be of any type; however, the greatest success to date has been in predicting porosity logs. From the 3-D seismic volume a series of sample‐based attributes is calculated. The objective is to derive a multiattribute transform, which is a linear or nonlinear transform between a subset of the attributes and the target log values. The selected subset is determined by a process of forward stepwise regression, which derives increasingly larger subsets of attributes. An extension of conventional crossplotting involves the use of a convolutional operator to resolve frequency differences between the target logs and the seismic data. In the linear mode, the transform consists of a series of weights derived by least‐squares minimization. In the nonlinear mode, a neural network is trained, using the selected attributes as inputs. Two types of neural networks have been evaluated: the multilayer feedforward network (MLFN) and the probabilistic neural network (PNN). Because of its mathematical simplicity, the PNN appears to be the network of choice. To estimate the reliability of the derived multiattribute transform, crossvalidation is used. In this process, each well is systematically removed from the training set, and the transform is rederived from the remaining wells. The prediction error for the hidden well is then calculated. The validation error, which is the average error for all hidden wells, is used as a measure of the likely prediction error when the transform is applied to the seismic volume. The method is applied to two real data sets. In each case, we see a continuous improvement in predictive power as we progress from single‐attribute regression to linear multiattribute prediction to neural network prediction. This improvement is evident not only on the training data but, more importantly, on the validation data. In addition, the neural network shows a significant improvement in resolution over that from linear regression.

APA, Harvard, Vancouver, ISO, and other styles

5

Abuassba, Adnan O. M., Dezheng Zhang, Xiong Luo, Ahmad Shaheryar, and Hazrat Ali. "Improving Classification Performance through an Advanced Ensemble Based Heterogeneous Extreme Learning Machines." Computational Intelligence and Neuroscience 2017 (2017): 1–11. http://dx.doi.org/10.1155/2017/3405463.

Full text

Abstract:

Extreme Learning Machine (ELM) is a fast-learning algorithm for a single-hidden layer feedforward neural network (SLFN). It often has good generalization performance. However, there are chances that it might overfit the training data due to having more hidden nodes than needed. To address the generalization performance, we use a heterogeneous ensemble approach. We propose an Advanced ELM Ensemble (AELME) for classification, which includes Regularized-ELM, L2-norm-optimized ELM (ELML2), and Kernel-ELM. The ensemble is constructed by training a randomly chosen ELM classifier on a subset of training data selected through random resampling. The proposed AELM-Ensemble is evolved by employing an objective function of increasing diversity and accuracy among the final ensemble. Finally, the class label of unseen data is predicted using majority vote approach. Splitting the training data into subsets and incorporation of heterogeneous ELM classifiers result in higher prediction accuracy, better generalization, and a lower number of base classifiers, as compared to other models (Adaboost, Bagging, Dynamic ELM ensemble, data splitting ELM ensemble, and ELM ensemble). The validity of AELME is confirmed through classification on several real-world benchmark datasets.

APA, Harvard, Vancouver, ISO, and other styles

6

Lai, Feilin, and Xiaojun Yang. "Improving Land Cover Classification Over a Large Coastal City Through Stacked Generalization with Filtered Training Samples." Photogrammetric Engineering & Remote Sensing 88, no. 7 (July 1, 2022): 451–59. http://dx.doi.org/10.14358/pers.21-00035r3.

Full text

Abstract:

To improve remote sensing-based land cover mapping over heterogenous landscapes, we developed an ensemble classifier based on stacked generalization with a new training sample refinement technique for the combiner. Specifically, a group of individual classifiers were identified and trained to derive land cover information from a satellite image covering a large complex coastal city. The mapping accuracy was quantitatively assessed with an independent reference data set, and several class probability measures were derived for each classifier. Meanwhile, various subsets were derived from the original training data set using the times of being correctly labeled by the individual classifiers as the thresholds, which were further used to train a random forest model as the combiner in generating the final class predictions. While outperforming each individual classifier, the combiner performed better when using the class probabilities rather than the class predictions as the meta-feature layers and performed significantly better when trained with a carefully selected subset rather than with the entire sample set. The novelties of this work are with the insight into the impact of different training sample subsets on the performance of stacked generalization and the filtering technique developed to prepare training samples for the combiner leading to a large accuracy improvement.

APA, Harvard, Vancouver, ISO, and other styles

7

Hao, Ruqian, Lin Liu, Jing Zhang, Xiangzhou Wang, Juanxiu Liu, Xiaohui Du, Wen He, Jicheng Liao, Lu Liu, and Yuanying Mao. "A Data-Efficient Framework for the Identification of Vaginitis Based on Deep Learning." Journal of Healthcare Engineering 2022 (February 27, 2022): 1–11. http://dx.doi.org/10.1155/2022/1929371.

Full text

Abstract:

Vaginitis is a gynecological disease affecting the health of millions of women all over the world. The traditional diagnosis of vaginitis is based on manual microscopy, which is time-consuming and tedious. The deep learning method offers a fast and reliable solution for an automatic early diagnosis of vaginitis. However, deep neural networks require massive well-annotated data. Manual annotation of microscopic images is highly cost extensive because it not only is a time-consuming process but also needs highly trained people (doctors, pathologists, or technicians). Most existing active learning approaches are not applicable in microscopic images due to the nature of complex backgrounds and numerous formed elements. To address the problem of high cost of labeling microscopic images, we present a data-efficient framework for the identification of vaginitis based on transfer learning and active learning strategies. The proposed informative sample selection strategy selected the minimal training subset, and then the pretrained convolutional neural network (CNN) was fine-tuned on the selected subset. The experiment results show that the proposed pipeline can save 37.5% annotation cost while maintaining competitive performance. The proposed promising novel framework can significantly save the annotation cost and has the potential of extending widely to other microscopic imaging applications, such as blood microscopic image analysis.

APA, Harvard, Vancouver, ISO, and other styles

8

Yao, Yu Kai, Yang Liu, Zhao Li, and Xiao Yun Chen. "An Effective K-Means Clustering Based SVM Algorithm." Applied Mechanics and Materials 333-335 (July 2013): 1344–48. http://dx.doi.org/10.4028/www.scientific.net/amm.333-335.1344.

Full text

Abstract:

Support Vector Machine (SVM) is one of the most popular and effective data mining algorithms which can be used to resolve classification or regression problems, and has attracted much attention these years. SVM could find the optimal separating hyperplane between classes, which afford outstanding generalization ability with it. Usually all the labeled records are used as training set. However, the optimal separating hyperplane only depends on a few crucial samples (Support Vectors, SVs), we neednt train SVM model on the whole training set. In this paper a novel SVM model based on K-means clustering is presented, in which only a small subset of the original training set is selected to constitute the final training set, and the SVM classifier is built through training on these selected samples. This greatly decrease the scale of the training set, and effectively saves the training and predicting cost of SVM, meanwhile guarantees its generalization performance.

APA, Harvard, Vancouver, ISO, and other styles

9

Nakoneczny, S. J., M. Bilicki, A. Pollo, M. Asgari, A. Dvornik, T. Erben, B. Giblin, et al. "Photometric selection and redshifts for quasars in the Kilo-Degree Survey Data Release 4." Astronomy & Astrophysics 649 (May 2021): A81. http://dx.doi.org/10.1051/0004-6361/202039684.

Full text

Abstract:

We present a catalog of quasars with their corresponding redshifts derived from the photometric Kilo-Degree Survey (KiDS) Data Release 4. We achieved it by training machine learning (ML) models, using optical ugri and near-infrared ZYJHKs bands, on objects known from Sloan Digital Sky Survey (SDSS) spectroscopy. We define inference subsets from the 45 million objects of the KiDS photometric data limited to 9-band detections, based on a feature space built from magnitudes and their combinations. We show that projections of the high-dimensional feature space on two dimensions can be successfully used, instead of the standard color-color plots, to investigate the photometric estimations, compare them with spectroscopic data, and efficiently support the process of building a catalog. The model selection and fine-tuning employs two subsets of objects: those randomly selected and the faintest ones, which allowed us to properly fit the bias versus variance trade-off. We tested three ML models: random forest (RF), XGBoost (XGB), and artificial neural network (ANN). We find that XGB is the most robust and straightforward model for classification, while ANN performs the best for combined classification and redshift. The ANN inference results are tested using number counts, Gaia parallaxes, and other quasar catalogs that are external to the training set. Based on these tests, we derived the minimum classification probability for quasar candidates which provides the best purity versus completeness trade-off: p(QSOcand) > 0.9 for r < 22 and p(QSOcand) > 0.98 for 22 < r < 23.5. We find 158 000 quasar candidates in the safe inference subset (r < 22) and an additional 185 000 candidates in the reliable extrapolation regime (22 < r < 23.5). Test-data purity equals 97% and completeness is 94%; the latter drops by 3% in the extrapolation to data fainter by one magnitude than the training set. The photometric redshifts were derived with ANN and modeled with Gaussian uncertainties. The test-data redshift error (mean and scatter) equals 0.009 ± 0.12 in the safe subset and −0.0004 ± 0.19 in the extrapolation, averaged over a redshift range of 0.14 < z < 3.63 (first and 99th percentiles). Our success of the extrapolation challenges the way that models are optimized and applied at the faint data end. The resulting catalog is ready for cosmology and active galactic nucleus (AGN) studies.

APA, Harvard, Vancouver, ISO, and other styles

10

Zavala, Valentina A., Tatiana Vidaurre, Xiaosong Huang, Sandro Casavilca, Jeannie Navarro, Michelle A. Williams, Sixto Sanchez, et al. "Abstract 3683: Identification of optimal set of genetic variants from a previously reported polygenic risk score for breast cancer risk prediction in Latin American women." Cancer Research 82, no. 12_Supplement (June 15, 2022): 3683. http://dx.doi.org/10.1158/1538-7445.am2022-3683.

Full text

Abstract:

Abstract Around 10% of genetic predisposition for breast cancer is explained by mutations in high/moderate penetrance genes. The remaining proportion is explained by multiple common variants of relatively small effect. A subset of these variants has been identified mostly in Europeans and Asians; and combined into polygenic risk scores (PRS) to predict breast cancer risk. Our aim is to identify a subset of variants to improve breast cancer risk prediction in Hispanics/Latinas (H/Ls).Breast cancer patients were recruited at the Instituto Nacional de Enfermedades Neoplásicas in Peru, to be part of The Peruvian Genetics and Genomics of Breast Cancer Study (PEGEN). Women without a diagnosis of breast cancer from a pregnancy outcomes study conducted in Peru were included as controls. After quality control filters, genome-wide genotypes were available for 1,809 cases and 3,334 controls. Missing genotypes were imputed using the Michigan Imputation Server using individuals from 1000 Genomes Project as reference. Genotypes for 313 previously reported breast cancer associated variants and 2 Latin American specific single nucleotide polymorphisms (SNPs) were extracted from the data, using an imputation r2 filter of 30%. Feature selection techniques were used to identify the best subset of SNPs for breast cancer prediction in Peruvian women. We randomly split the PEGEN data by 4:1 ratio for training/validation and testing. Training/validation data were resampled and split in 3:1 ratio into training and validation sets. SNP ranking and selection were done by bootstrapping results from 100 resampled training and validation sets. PRS were built by adding counts of risk alleles weighted by previously reported beta coefficients. The Area Under the Curve (AUC) was used to estimate the prediction accuracy of subsets of SNPs selected with different techniques. Logistic regression was used to test the association between standardized PRS residuals (after adjustment for genetic ancestry) and breast cancer risk. Of the 315 reported variants, 274 were available from the imputed dataset. The full 274-SNP PRS was associated with an AUC of 0.63 (95%CI=0.59-0.66) in the PEGEN study. Using different feature selection methods, we found subsets of SNPs that were associated with AUC values between 0.65-0.69. The best method (AUC=0.69, 95%CI=0.66-0.72) included a subset of 98 SNPs. Sixty-eight SNPs were selected by all methods, including the protective SNP rs140068132 in the 6q25 region, which is associated with Indigenous American ancestry and the largest contribution to the AUC.We identified a subset of 98 SNPs from a previously identified breast cancer PRS that improves breast cancer risk prediction compared to the full set, in women of high Indigenous American ancestry from Peru. Replication in women from Mexico and Colombia, and H/Ls from the U.S will allow us to confirm these results. Citation Format: Valentina A. Zavala, Tatiana Vidaurre, Xiaosong Huang, Sandro Casavilca, Jeannie Navarro, Michelle A. Williams, Sixto Sanchez, Elad Ziv, Luis Carvajal-Carmona, Susan L. Neuhausen7, Bizu Gelaye, Laura Fejerman. Identification of optimal set of genetic variants from a previously reported polygenic risk score for breast cancer risk prediction in Latin American women [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2022; 2022 Apr 8-13. Philadelphia (PA): AACR; Cancer Res 2022;82(12_Suppl):Abstract nr 3683.

APA, Harvard, Vancouver, ISO, and other styles

11

Swartz, James A., Qiao Lin, and Yerim Kim. "A measurement invariance analysis of selected Opioid Overdose Knowledge Scale (OOKS) items among bystanders and first responders." PLOS ONE 17, no. 10 (October 14, 2022): e0271418. http://dx.doi.org/10.1371/journal.pone.0271418.

Full text

Abstract:

The Opioid Overdose Knowledge Scale (OOKS) is widely used as an adjunct to opioid education and naloxone distribution (OEND) for assessing pre- and post-training knowledge. However, the extent to which the OOKS performs comparably for bystander and first responder groups has not been well determined. We used exploratory structural equation modeling (ESEM) to assess the measurement invariance of an OOKS item subset when used as an OEND training pre-test. We used secondary analysis of pre-test data collected from 446 first responders and 1,349 bystanders (N = 1,795) attending OEND trainings conducted by two county public health departments. Twenty-four items were selected by practitioner/trainer consensus from the original 45-item OOKS instrument with an additional 2 removed owing to low response variation. We used exploratory factor analysis (EFA) followed by ESEM to identify a factor structure, which we assessed for configural, metric, and scalar measurement invariance by participant group using the 22 dichotomous items (correct/incorrect) as factor indicators. EFA identified a 3-factor model consisting of items assessing: basic overdose risk information, signs of an overdose, and rescue procedures/advanced overdose risk information. Model fit by ESEM estimation versus confirmatory factor analysis showed the ESEM model afforded a better fit. Measurement invariance analyses indicated the 3-factor model fit the data across all levels of invariance per standard fit statistic metrics. The reduced set of 22 OOKS items appears to offer comparable measurement of pre-training knowledge on opioid overdose risks, signs of an overdose, and rescue procedures for both bystanders and first responders.

APA, Harvard, Vancouver, ISO, and other styles

12

Chen, Yen-Liang, Li-Chen Cheng, and Yi-Jun Zhang. "Building a training dataset for classification under a cost limitation." Electronic Library 39, no. 1 (February 24, 2021): 77–96. http://dx.doi.org/10.1108/el-07-2020-0209.

Full text

Abstract:

Purpose A necessary preprocessing of document classification is to label some documents so that a classifier can be built based on which the remaining documents can be classified. Because each document differs in length and complexity, the cost of labeling each document is different. The purpose of this paper is to consider how to select a subset of documents for labeling with a limited budget so that the total cost of the spending does not exceed the budget limit, while at the same time building a classifier with the best classification results. Design/methodology/approach In this paper, a framework is proposed to select the instances for labeling that integrate two clustering algorithms and two centroid selection methods. From the selected and labeled instances, five different classifiers were constructed with good classification accuracy to prove the superiority of the selected instances. Findings Experimental results show that this method can establish a training data set containing the most suitable data under the premise of considering the cost constraints. The data set considers both “data representativeness” and “data selection cost,” so that the training data labeled by experts can effectively establish a classifier with high accuracy. Originality/value No previous research has considered how to establish a training set with a cost limit when each document has a distinct labeling cost. This paper is the first attempt to resolve this issue.

APA, Harvard, Vancouver, ISO, and other styles

13

Jia, Jinyuan, Yupei Liu, Xiaoyu Cao, and Neil Zhenqiang Gong. "Certified Robustness of Nearest Neighbors against Data Poisoning and Backdoor Attacks." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 9 (June 28, 2022): 9575–83. http://dx.doi.org/10.1609/aaai.v36i9.21191.

Full text

Abstract:

Data poisoning attacks and backdoor attacks aim to corrupt a machine learning classifier via modifying, adding, and/or removing some carefully selected training examples, such that the corrupted classifier makes incorrect predictions as the attacker desires. The key idea of state-of-the-art certified defenses against data poisoning attacks and backdoor attacks is to create a majority vote mechanism to predict the label of a testing example. Moreover, each voter is a base classifier trained on a subset of the training dataset. Classical simple learning algorithms such as k nearest neighbors (kNN) and radius nearest neighbors (rNN) have intrinsic majority vote mechanisms. In this work, we show that the intrinsic majority vote mechanisms in kNN and rNN already provide certified robustness guarantees against data poisoning attacks and backdoor attacks. Moreover, our evaluation results on MNIST and CIFAR10 show that the intrinsic certified robustness guarantees of kNN and rNN outperform those provided by state-of-the-art certified defenses. Our results serve as standard baselines for future certified defenses against data poisoning attacks and backdoor attacks.

APA, Harvard, Vancouver, ISO, and other styles

14

Ahmad, Wasim, Sheraz Ali Khan, Cheol Hong Kim, and Jong-Myon Kim. "Feature Selection for Improving Failure Detection in Hard Disk Drives Using a Genetic Algorithm and Significance Scores." Applied Sciences 10, no. 9 (May 4, 2020): 3200. http://dx.doi.org/10.3390/app10093200.

Full text

Abstract:

Hard disk drives (HDD) are used for data storage in personal computing platforms as well as commercial datacenters. An abrupt failure of these devices may result in an irreversible loss of critical data. Most HDD use self-monitoring, analysis, and reporting technology (SMART), and record different performance parameters to assess their own health. However, not all SMART attributes are effective at detecting a failing HDD. In this paper, a two-tier approach is presented to select the most effective precursors for a failing HDD. In the first tier, a genetic algorithm (GA) is used to select a subset of SMART attributes that lead to easily distinguishable and well clustered feature vectors in the selected subset. The GA finds the optimal feature subset by evaluating only combinations of SMART attributes, while ignoring their individual fitness. A second tier is proposed to filter the features selected using the GA by evaluating each feature independently, using a significance score that measures the statistical contribution of a feature towards disk failures. The resultant subset of selected SMART attributes is used to train a generative classifier, the naïve Bayes classifier. The proposed method is tested on a SMART dataset from a commercial datacenter, and the results are compared with state-of-the-art methods, indicating that the proposed method has a better failure detection rate and a reasonable false alarm rate. It uses fewer SMART attributes, which reduces the required training time for the classifier and does not require tuning any parameters or thresholds.

APA, Harvard, Vancouver, ISO, and other styles

15

Cardellicchio, Angelo, Sergio Ruggieri, Valeria Leggieri, and Giuseppina Uva. "View VULMA: Data Set for Training a Machine-Learning Tool for a Fast Vulnerability Analysis of Existing Buildings." Data 7, no. 1 (December 31, 2021): 4. http://dx.doi.org/10.3390/data7010004.

Full text

Abstract:

The paper presents View VULMA, a data set specifically designed for training machine-learning tools for elaborating fast vulnerability analysis of existing buildings. Such tools require supervised training via an extensive set of building imagery, for which several typological parameters should be defined, with a proper label assigned to each sample on a per-parameter basis. Thus, it is clear how defining an adequate training data set plays a key role, and several aspects should be considered, such as data availability, preprocessing, augmentation and balancing according to the selected labels. In this paper, we highlight all these issues, describing the pursued strategies to elaborate a reliable data set. In particular, a detailed description of both requirements (e.g., scale and resolution of images, evaluation parameters and data heterogeneity) and the steps followed to define View VULMA are provided, starting from the data assessment (which allowed to reduce the initial sample of about 20.000 images to a subset of about 3.000 pictures), to achieve the goal of training a transfer-learning-based automated tool for fast estimation of the vulnerability of existing buildings from single pictures.

APA, Harvard, Vancouver, ISO, and other styles

16

Ren, Jiadong, Jiawei Guo, Wang Qian, Huang Yuan, Xiaobing Hao, and Hu Jingjing. "Building an Effective Intrusion Detection System by Using Hybrid Data Optimization Based on Machine Learning Algorithms." Security and Communication Networks 2019 (June 16, 2019): 1–11. http://dx.doi.org/10.1155/2019/7130868.

Full text

Abstract:

Intrusion detection system (IDS) can effectively identify anomaly behaviors in the network; however, it still has low detection rate and high false alarm rate especially for anomalies with fewer records. In this paper, we propose an effective IDS by using hybrid data optimization which consists of two parts: data sampling and feature selection, called DO_IDS. In data sampling, the Isolation Forest (iForest) is used to eliminate outliers, genetic algorithm (GA) to optimize the sampling ratio, and the Random Forest (RF) classifier as the evaluation criteria to obtain the optimal training dataset. In feature selection, GA and RF are used again to obtain the optimal feature subset. Finally, an intrusion detection system based on RF is built using the optimal training dataset obtained by data sampling and the features selected by feature selection. The experiment will be carried out on the UNSW-NB15 dataset. Compared with other algorithms, the model has obvious advantages in detecting rare anomaly behaviors.

APA, Harvard, Vancouver, ISO, and other styles

17

Sitienei, Miriam, Ayubu Anapapa, and Argwings Otieno. "Random Forest Regression in Maize Yield Prediction." Asian Journal of Probability and Statistics 23, no. 4 (August 9, 2023): 43–52. http://dx.doi.org/10.9734/ajpas/2023/v23i4511.

Full text

Abstract:

Artificial Intelligence is the discipline of making computers behave without explicit programming. Machine learning is a subset of artificial Intelligence that enables machines to learn autonomously from previous data without explicit programming. The purpose of machine learning in agriculture is to increase crop yield and quality in the agricultural sector. It is driven by the emergence of big data technologies and high-performance computation, which provide new opportunities to unravel, quantify, and comprehend data-intensive agricultural operational processes. Random Forest is an ensemble technique that reduces the result's overfitting. This algorithm is primarily utilized for forecasting. It generates a forest with numerous trees. The random forest classifier predicts that the model's accuracy will increase as the number of trees in the forest increases. All through the training phase, multiple decision trees are constructed. It generates subsets of data from randomly selected training samples with replacement. Each data subset is employed to train decision trees. It utilizes multiple trees to reduce the possibility of overfitting. Maize is a staple food in Kenya and having it in sufficient amounts in the country assures the farmers' food security and economic stability. This study predicted maize yield in the Kenyan county of Uasin Gishu using the machine learning algorithm Random Forest regression. The regression model employed a mixed-methods research design, and the survey employed well-structured questionnaires containing quantitative and qualitative variables, which were directly administered to 30 clustered wards' representative farmers. The questionnaire encompassed 30 maize production-related variables from 900 randomly selected maize producers in 30 wards. The model was able to identify important variables from the dataset and predicted maize yield. The prediction evaluation used machine learning regression metrics, Root Mean Squared error-RMSE=0.52199, Mean Squared Error-MSE =0.27248, and Mean Absolute Error-MAE = 0.471722. The model predicted maize yield and indicated the contribution of each variable to the overall prediction.

APA, Harvard, Vancouver, ISO, and other styles

18

Xu, Xiaofeng, Ivor W. Tsang, and Chuancai Liu. "Improving Generalization via Attribute Selection on Out-of-the-Box Data." Neural Computation 32, no. 2 (February 2020): 485–514. http://dx.doi.org/10.1162/neco_a_01256.

Full text

Abstract:

Zero-shot learning (ZSL) aims to recognize unseen objects (test classes) given some other seen objects (training classes) by sharing information of attributes between different objects. Attributes are artificially annotated for objects and treated equally in recent ZSL tasks. However, some inferior attributes with poor predictability or poor discriminability may have negative impacts on the ZSL system performance. This letter first derives a generalization error bound for ZSL tasks. Our theoretical analysis verifies that selecting the subset of key attributes can improve the generalization performance of the original ZSL model, which uses all the attributes. Unfortunately, previous attribute selection methods have been conducted based on the seen data, and their selected attributes have poor generalization capability to the unseen data, which is unavailable in the training stage of ZSL tasks. Inspired by learning from pseudo-relevance feedback, this letter introduces out-of-the-box data—pseudo-data generated by an attribute-guided generative model—to mimic the unseen data. We then present an iterative attribute selection (IAS) strategy that iteratively selects key attributes based on the out-of-the-box data. Since the distribution of the generated out-of-the-box data is similar to that of the test data, the key attributes selected by IAS can be effectively generalized to test data. Extensive experiments demonstrate that IAS can significantly improve existing attribute-based ZSL methods and achieve state-of-the-art performance.

APA, Harvard, Vancouver, ISO, and other styles

19

Abuassba, Adnan Omer, Dezheng Zhang, and Xiong Luo. "A Heterogeneous AdaBoost Ensemble Based Extreme Learning Machines for Imbalanced Data." International Journal of Cognitive Informatics and Natural Intelligence 13, no. 3 (July 2019): 19–35. http://dx.doi.org/10.4018/ijcini.2019070102.

Full text

Abstract:

Extreme learning machine (ELM) is an effective learning algorithm for the single hidden layer feed-forward neural network (SLFN). It is diversified in the form of kernels or feature mapping functions, while achieving a good learning performance. It is agile in learning and often has good performance, including kernel ELM and Regularized ELM. Dealing with imbalanced data has been a long-term focus for the learning algorithms to achieve satisfactory analytical results. It is obvious that the unbalanced class distribution imposes very challenging obstacles to implement learning tasks in real-world applications, including online visual tracking and image quality assessment. This article addresses this issue through advanced diverse AdaBoost based ELM ensemble (AELME) for imbalanced binary and multiclass data classification. This article aims to improve classification accuracy of the imbalanced data. In the proposed method, the ensemble is developed while splitting the trained data into corresponding subsets. And different algorithms of enhanced ELM, including regularized ELM and kernel ELM, are used as base learners, so that an active learner is constructed from a group of relatively weak base learners. Furthermore, AELME is implemented by training a randomly selected ELM classifier on a subset, chosen by random re-sampling. Then, the labels of unseen data could be predicted using the weighting approach. AELME is validated through classification on real-world benchmark datasets.

APA, Harvard, Vancouver, ISO, and other styles

20

Zahedian, Sara, Przemysław Sekuła, Amir Nohekhan, and Zachary Vander Laan. "Estimating Hourly Traffic Volumes using Artificial Neural Network with Additional Inputs from Automatic Traffic Recorders." Transportation Research Record: Journal of the Transportation Research Board 2674, no. 3 (March 2020): 272–82. http://dx.doi.org/10.1177/0361198120910737.

Full text

Abstract:

Traffic volumes are an essential input to many highway planning and design models; however, collecting this data for all road network segments is neither practical nor cost-effective. Accordingly, transportation agencies must find ways to leverage limited ground truth volume data to obtain reasonable estimates at scale on the statewide network. This paper aims to investigate the impact of selecting a subset of available automatic traffic recorders (ATRs) (i.e., the ground truth volume data source) and incorporating their data as explanatory variables into a previously developed machine learning regression model for estimating hourly traffic volumes. The study introduces a handful of strategies for selecting this subset of ATRs and walks through the process of choosing them and training models using their data as additional inputs using the New Hampshire road network as a case study. The results reveal that the overall performance of the artificial neural network (ANN) machine learning model improves with the additional inputs of selected ATRs. However, this improvement is more significant if the ATRs are selected based on their spatial distribution over the traffic message channel (TMC) network. For instance, selecting eight ATR stations according to the TMC coverage-based strategy and training the ANN with their inputs leads to average relative reductions of 35.39% and 13.44% in the mean absolute percentage error (MAPE) and error to maximum flow ratio (EMFR), respectively. The results achieved by this study can be further expanded to create a practical strategy for optimizing the number and location of ATRs through transportation networks in a state.

APA, Harvard, Vancouver, ISO, and other styles

21

Dong, Naghedolfeizi, Aberra, and Zeng. "Spectral–Spatial Discriminant Feature Learning for Hyperspectral Image Classification." Remote Sensing 11, no. 13 (June 29, 2019): 1552. http://dx.doi.org/10.3390/rs11131552.

Full text

Abstract:

Sparse representation classification (SRC) is being widely applied to target detection in hyperspectral images (HSI). However, due to the problem in HSI that high-dimensional data contain redundant information, SRC methods may fail to achieve high classification performance, even with a large number of spectral bands. Selecting a subset of predictive features in a high-dimensional space is an important and challenging problem for hyperspectral image classification. In this paper, we propose a novel discriminant feature learning (DFL) method, which combines spectral and spatial information into a hypergraph Laplacian. First, a subset of discriminative features is selected, which preserve the spectral structure of data and the inter- and intra-class constraints on labeled training samples. A feature evaluator is obtained by semi-supervised learning with the hypergraph Laplacian. Secondly, the selected features are mapped into a further lower-dimensional eigenspace through a generalized eigendecomposition of the Laplacian matrix. The finally extracted discriminative features are used in a joint sparsity-model algorithm. Experiments conducted with benchmark data sets and different experimental settings show that our proposed method increases classification accuracy and outperforms the state-of-the-art HSI classification methods.

APA, Harvard, Vancouver, ISO, and other styles

22

Gonzalez-Sanchez, Alberto, Juan Frausto-Solis, and Waldo Ojeda-Bustamante. "Attribute Selection Impact on Linear and Nonlinear Regression Models for Crop Yield Prediction." Scientific World Journal 2014 (2014): 1–10. http://dx.doi.org/10.1155/2014/509429.

Full text

Abstract:

Efficient cropping requires yield estimation for each involved crop, where data-driven models are commonly applied. In recent years, some data-driven modeling technique comparisons have been made, looking for the best model to yield prediction. However, attributes are usually selected based on expertise assessment or in dimensionality reduction algorithms. A fairer comparison should include the best subset of features for each regression technique; an evaluation including several crops is preferred. This paper evaluates the most common data-driven modeling techniques applied to yield prediction, using a complete method to define the best attribute subset for each model. Multiple linear regression, stepwise linear regression, M5′ regression trees, and artificial neural networks (ANN) were ranked. The models were built using real data of eight crops sowed in an irrigation module of Mexico. To validate the models, three accuracy metrics were used: the root relative square error (RRSE), relative mean absolute error (RMAE), and correlation factor (R). The results show that ANNs are more consistent in the best attribute subset composition between the learning and the training stages, obtaining the lowest average RRSE (86.04%), lowest average RMAE (8.75%), and the highest average correlation factor (0.63).

APA, Harvard, Vancouver, ISO, and other styles

23

Najafi-Ghiri, Mahdi, Marzieh Mokarram, and Hamid Reza Owliaie. "Prediction of soil clay minerals from some soil properties with use of feature selection algorithm and ANFIS methods." Soil Research 57, no. 7 (2019): 788. http://dx.doi.org/10.1071/sr18352.

Full text

Abstract:

Researchers use different methods to investigate and quantify clay minerals. X-ray diffraction is a common and widespread approach for clay mineralogy investigation, but is time-consuming and expensive, especially in highly calcareous soils. The aim of this research was prediction of clay minerals in calcareous soils of southern Iran using a feature selection algorithm and adaptive neuro-fuzzy inference system (ANFIS) methods. Fifty soil samples from different climatic regions of southern Iran were collected and different climatic, soil properties and clay minerals were determined using X-ray diffraction. Feature selection algorithms were used for selection of the best feature subset for prediction of clay mineral types along with two sets of training and testing data. Results indicated that the best feature subset by Best-First for prediction of illite was cation exchange capacity (CEC), sand, total potassium, silt and agroclimatic index (correlation coefficient (R) = 0.99 for training and testing data); for smectite was precipitation, temperature, evapotranspiration and CEC (R = 0.89 and 0.87 for training and testing data respectively); and for palygorskite was precipitation, temperature, evapotranspiration and calcium carbonate equivalent (CCE) (R = 0.98 for training and testing data). An attempt was made to predict clay minerals type by ANFIS using selected data from the feature selection algorithm. The evaluation of method by calculating root mean square error (RMSE), mean absolute error (MAE) and R indicated that the ANFIS method may be suitable for illite, chlorite, smectite and palygorskite prediction (RMSE, MAE and R of 0.001–0.028, 0.004–0.012 and 0.67–0.89 respectively for training and testing data). Comparison of data for all clay minerals showed that ANFIS method did not predict illite and chlorite as well as other minerals in the studied soils.

APA, Harvard, Vancouver, ISO, and other styles

24

Woodrow, Sarah I., Mark Bernstein, and M. Christopher Wallace. "Safety of intracranial aneurysm surgery performed in a postgraduate training program: implications for training." Journal of Neurosurgery 102, no. 4 (April 2005): 616–21. http://dx.doi.org/10.3171/jns.2005.102.4.0616.

Full text

Abstract:

Object. Patient care and educational experience have long formed a dichotomy in modern surgical training. In neurosurgery, achieving a delicate balance between these two factors has been challenged by recent trends in the field including increased subspecialization, emerging technologies, and decreased resident work hours. In this study the authors evaluated the experience profiles of neurosurgical trainees at a large Canadian academic center and the safety of their practice on patient care. Methods. Two hundred ninety-three patients who underwent surgery for intracranial aneurysm clipping between 1993 and 1996 were selected. Prospective data were available in 167 cases, allowing the operating surgeon to be identified. Postoperative data and follow-up data were gathered retrospectively to measure patient outcomes. In 167 cases, a total of 183 aneurysms were clipped, the majority (91%) by neurosurgical trainees. Trainees performed dissections on aneurysms that were predominantly small (<1.5 cm in diameter; 77% of patients) and ruptured (64% of patients). Overall mortality rates for the patients treated by the trainee group were 4% (two of 52 patients) and 9% (nine of 100 patients) for unruptured and ruptured aneurysm cases, respectively. Patient outcomes were comparable to those reported in historical data. Staff members appeared to be primary surgeons in a select subset of cases. Conclusions. Neurosurgical trainees at this institution are exposed to a broad spectrum of intracranial aneurysms, although some case selection does occur. With careful supervision, intracranial aneurysm surgery can be safely delegated to trainees without compromising patient outcomes. Current trends in practice patterns in neurosurgery mandate ongoing monitoring of residents' operative experience while ensuring continued excellence in patient care.

APA, Harvard, Vancouver, ISO, and other styles

25

Szyda, J., K. Żukowski, S. Kamiński, and A. Żarnecki. "Testing different single nucleotide polymorphism selection strategies for prediction of genomic breeding values in dairy cattle based on low density panels." Czech Journal of Animal Science 58, No. 3 (March 4, 2013): 136–45. http://dx.doi.org/10.17221/6670-cjas.

Full text

Abstract:

In human and animal genetics dense single nucleotide polymorphism (SNP) panels are widely used to describe genetic variation. In particular genomic selection in dairy cattle has become a routinely applied tool for prediction of additive genetic values of animals, especially of young selection candidates. The aim of the study was to investigate how well an additive genetic value can be predicted using various sets of approximately 3000 SNPs selected out of the 54 001 SNPs in an Illumina BovineSNP50 BeadChip high density panel. Effects of SNPs from the nine subsets of the 54 001 panel were estimated using a model with a random uncorrelated SNPs effect based on a training data set of 1216 Polish Holstein-Friesian bulls whose phenotypic records were approximated by deregressed estimated breeding values for milk, protein, and fat yields. Predictive ability of the low density panels was assessed using a validation data set of 622 bulls. Correlations between direct and conventional breeding values routinely estimated for the Polish population were similar across traits and clearly across sets of SNPs. For the training data set correlations varied between 0.94 and 0.98, for the validation data set between 0.25 and 0.46. The corresponding correlations estimated using the 54 001 panel were: 0.98 for the three traits (training), 0.98 (milk and fat yields, validation), and 0.97 (protein yield, validation). The optimal subset consisted of SNPs selected based on their highest effects for milk yield obtained from the evaluation of all 54 001 SNPs. A low density SNP panel allows for reasonably good prediction of future breeding values. Even though correlations between direct and conventional breeding values were moderate, for young selection candidates a low density panel is a better predictor than a commonly used average of parental breeding values.

APA, Harvard, Vancouver, ISO, and other styles

26

He, Ruimin, Xiaohua Yang, Tengxiang Li, Yaolin He, Xiaoxue Xie, Qilei Chen, Zijian Zhang, and Tingting Cheng. "A Machine Learning-Based Predictive Model of Epidermal Growth Factor Mutations in Lung Adenocarcinomas." Cancers 14, no. 19 (September 25, 2022): 4664. http://dx.doi.org/10.3390/cancers14194664.

Full text

Abstract:

Data from 758 patients with lung adenocarcinoma were retrospectively collected. All patients had undergone computed tomography imaging and EGFR gene testing. Radiomic features were extracted using the medical imaging tool 3D-Slicer and were combined with the clinical features to build a machine learning prediction model. The high-dimensional feature set was screened for optimal feature subsets using principal component analysis (PCA) and the least absolute shrinkage and selection operator (LASSO). Model prediction of EGFR mutation status in the validation group was evaluated using multiple classifiers. We showed that six clinical features and 622 radiomic features were initially collected. Thirty-one radiomic features with non-zero correlation coefficients were obtained by LASSO regression, and 24 features correlated with label values were obtained by PCA. The shared radiomic features determined by these two methods were selected and combined with the clinical features of the respective patient to form a subset of features related to EGFR mutations. The full dataset was partitioned into training and test sets at a ratio of 7:3 using 10-fold cross-validation. The area under the curve (AUC) of the four classifiers with cross-validations was: (1) K-nearest neighbor (AUCmean = 0.83, Acc = 81%); (2) random forest (AUCmean = 0.91, Acc = 83%); (3) LGBM (AUCmean = 0.94, Acc = 88%); and (4) support vector machine (AUCmean = 0.79, Acc = 83%). In summary, the subset of radiographic and clinical features selected by feature engineering effectively predicted the EGFR mutation status of this NSCLC patient cohort.

APA, Harvard, Vancouver, ISO, and other styles

27

Chau, K. W., and C. L. Wu. "A hybrid model coupled with singular spectrum analysis for daily rainfall prediction." Journal of Hydroinformatics 12, no. 4 (April 2, 2010): 458–73. http://dx.doi.org/10.2166/hydro.2010.032.

Full text

Abstract:

A hybrid model integrating artificial neural networks and support vector regression was developed for daily rainfall prediction. In the modeling process, singular spectrum analysis was first adopted to decompose the raw rainfall data. Fuzzy C-means clustering was then used to split the training set into three crisp subsets which may be associated with low-, medium- and high-intensity rainfall. Two local artificial neural network models were involved in training and predicting low- and medium-intensity subsets whereas a local support vector regression model was applied to the high-intensity subset. A conventional artificial neural network model was selected as the benchmark. The artificial neural network with the singular spectrum analysis was developed for the purpose of examining the singular spectrum analysis technique. The models were applied to two daily rainfall series from China at 1-day-, 2-day- and 3-day-ahead forecasting horizons. Results showed that the hybrid support vector regression model performed the best. The singular spectrum analysis model also exhibited considerable accuracy in rainfall forecasting. Also, two methods to filter reconstructed components of singular spectrum analysis, supervised and unsupervised approaches, were compared. The unsupervised method appeared more effective where nonlinear dependence between model inputs and output can be considered.

APA, Harvard, Vancouver, ISO, and other styles

28

El-Gawady, Aliaa, Mohamed A. Makhlouf, BenBella S. Tawfik, and Hamed Nassar. "Machine Learning Framework for the Prediction of Alzheimer’s Disease Using Gene Expression Data Based on Efficient Gene Selection." Symmetry 14, no. 3 (February 28, 2022): 491. http://dx.doi.org/10.3390/sym14030491.

Full text

Abstract:

In recent years, much research has focused on using machine learning (ML) for disease prediction based on gene expression (GE) data. However, many diseases have received considerable attention, whereas some, including Alzheimer’s disease (AD), have not, perhaps due to data shortage. The present work is intended to fill this gap by introducing a symmetric framework to predict AD from GE data, with the aim to produce the most accurate prediction using the smallest number of genes. The framework works in four stages after it receives a training dataset: pre-processing, gene selection (GS), classification, and AD prediction. The symmetry of the model is manifested in all of its stages. In the pre-processing stage gene columns in the training dataset are pre-processed identically. In the GS stage, the same user-defined filter metrics are invoked on every gene individually, and so are the same user-defined wrapper metrics. In the classification stage, a number of user-defined ML models are applied identically using the minimal set of genes selected in the preceding stage. The core of the proposed framework is a meticulous GS algorithm which we have designed to nominate eight subsets of the original set of genes provided in the training dataset. Exploring the eight subsets, the algorithm selects the best one to describe AD, and also the best ML model to predict the disease using this subset. For credible results, the framework calculates performance metrics using repeated stratified k-fold cross validation. To evaluate the framework, we used an AD dataset of 1157 cases and 39,280 genes, obtained by combining a number of smaller public datasets. The cases were split in two partitions, 1000 for training/testing, using 10-fold CV repeated 30 times, and 157 for validation. From the testing/training phase, the framework identified only 1058 genes to be the most relevant and the support vector machine (SVM) model to be the most accurate with these genes. In the final validation, we used the 157 cases that were never seen by the SVM classifier. For credible performance evaluation, we evaluated the classifier via six metrics, for which we obtained impressive values. Specifically, we obtained 0.97, 0.97, 0.98, 0.945, 0.972, and 0.975 for the sensitivity (recall), specificity, precision, kappa index, AUC, and accuracy, respectively.

APA, Harvard, Vancouver, ISO, and other styles

29

Mazloom, Reza, Hongmin Li, Doina Caragea, Cornelia Caragea, and Muhammad Imran. "A Hybrid Domain Adaptation Approach for Identifying Crisis-Relevant Tweets." International Journal of Information Systems for Crisis Response and Management 11, no. 2 (July 2019): 1–19. http://dx.doi.org/10.4018/ijiscram.2019070101.

Full text

Abstract:

Huge amounts of data generated on social media during emergency situations is regarded as a trove of critical information. The use of supervised machine learning techniques in the early stages of a crisis is challenged by the lack of labeled data for that event. Furthermore, supervised models trained on labeled data from a prior crisis may not produce accurate results, due to inherent crisis variations. To address these challenges, the authors propose a hybrid feature-instance-parameter adaptation approach based on matrix factorization, k-nearest neighbors, and self-training. The proposed feature-instance adaptation selects a subset of the source crisis data that is representative for the target crisis data. The selected labeled source data, together with unlabeled target data, are used to learn self-training domain adaptation classifiers for the target crisis. Experimental results have shown that overall the hybrid domain adaptation classifiers perform better than the supervised classifiers learned from the original source data.

APA, Harvard, Vancouver, ISO, and other styles

30

Sharpe, P. K., H. E. Solberg, K. Rootwelt, and M. Yearworth. "Artificial neural networks in diagnosis of thyroid function from in vitro laboratory tests." Clinical Chemistry 39, no. 11 (November 1, 1993): 2248–53. http://dx.doi.org/10.1093/clinchem/39.11.2248.

Full text

Abstract:

Abstract We studied the potential benefit of using artificial neural networks (ANNs) for the diagnosis of thyroid function. We examined two types of ANN architecture and assessed their robustness in the face of diagnostic noise. The thyroid function data we used had previously been studied by multivariate statistical methods and a variety of pattern-recognition techniques. The total data set comprised 392 cases that had been classified according to both thyroid function and 19 clinical categories. All cases had a complete set of results of six laboratory tests (total thyroxine, free thyroxine, triiodothyronine, triiodothyronine uptake test, thyrotropin, and thyroxine-binding globulin). This data set was divided into subsets used for training the networks and for testing their performance; the test subsets contained various proportions of cases with diagnostic noise to mimic real-life diagnostic situations. The networks studied were a multilayer perceptron trained by back-propagation, and a learning vector quantization network. The training data subsets were selected according to two strategies: either training data based on cases with extreme values for the laboratory tests with randomly selected nonextreme cases added, or training cases from very pure functional groups. Both network architectures were efficient irrespective of the type of training data. The correct allocation of cases in test data subsets was 96.4-99.7% when extreme values were used for training and 92.7-98.8% when only pure cases were used.

APA, Harvard, Vancouver, ISO, and other styles

31

Hensel, Stefan, Marin B. Marinov, Michael Koch, and Dimitar Arnaudov. "Evaluation of Deep Learning-Based Neural Network Methods for Cloud Detection and Segmentation." Energies 14, no. 19 (September 27, 2021): 6156. http://dx.doi.org/10.3390/en14196156.

Full text

Abstract:

This paper presents a systematic approach for accurate short-time cloud coverage prediction based on a machine learning (ML) approach. Based on a newly built omnidirectional ground-based sky camera system, local training and evaluation data sets were created. These were used to train several state-of-the-art deep neural networks for object detection and segmentation. For this purpose, the camera-generated a full hemispherical image every 30 min over two months in daylight conditions with a fish-eye lens. From this data set, a subset of images was selected for training and evaluation according to various criteria. Deep neural networks, based on the two-stage R-CNN architecture, were trained and compared with a U-net segmentation approach implemented by CloudSegNet. All chosen deep networks were then evaluated and compared according to the local situation.

APA, Harvard, Vancouver, ISO, and other styles

32

Ramesh, Nisha, Ting Liu, and Tolga Tasdizen. "Cell Detection Using Extremal Regions in a Semisupervised Learning Framework." Journal of Healthcare Engineering 2017 (2017): 1–13. http://dx.doi.org/10.1155/2017/4080874.

Full text

Abstract:

This paper discusses an algorithm to build a semisupervised learning framework for detecting cells. The cell candidates are represented as extremal regions drawn from a hierarchical image representation. Training a classifier for cell detection using supervised approaches relies on a large amount of training data, which requires a lot of effort and time. We propose a semisupervised approach to reduce this burden. The set of extremal regions is generated using a maximally stable extremal region (MSER) detector. A subset of nonoverlapping regions with high similarity to the cells of interest is selected. Using the tree built from the MSER detector, we develop a novel differentiable unsupervised loss term that enforces the nonoverlapping constraint with the learned function. Our algorithm requires very few examples of cells with simple dot annotations for training. The supervised and unsupervised losses are embedded in a Bayesian framework for probabilistic learning.

APA, Harvard, Vancouver, ISO, and other styles

33

Yi, Liu, Diao Xing-chun, Cao Jian-jun, Zhou Xing, and Shang Yu-ling. "A Method for Entity Resolution in High Dimensional Data Using Ensemble Classifiers." Mathematical Problems in Engineering 2017 (2017): 1–11. http://dx.doi.org/10.1155/2017/4953280.

Full text

Abstract:

In order to improve utilization rate of high dimensional data features, an ensemble learning method based on feature selection for entity resolution is developed. Entity resolution is regarded as a binary classification problem, an optimization model is designed to maximize each classifier’s classification accuracy and dissimilarity between classifiers and minimize cardinality of features. A modified multiobjective ant colony optimization algorithm is employed to solve the model for each base classifier, two pheromone matrices are set up, weighted product method is applied to aggregate values of two pheromone matrices, and feature’s Fisher discriminant rate of records’ similarity vector is calculated as heuristic information. A solution which is called complementary subset is selected from Pareto archive according to the descending order of three objectives to train the given base classifier. After training all base classifiers, their classification outputs are aggregated by max-wins voting method to obtain the ensemble classifiers’ final result. A simulation experiment is carried out on three classical datasets. The results show the effectiveness of our method, as well as a better performance compared with the other two methods.

APA, Harvard, Vancouver, ISO, and other styles

34

Maya Gopal P S and Bhargavi R. "Selection of Important Features for Optimizing Crop Yield Prediction." International Journal of Agricultural and Environmental Information Systems 10, no. 3 (July 2019): 54–71. http://dx.doi.org/10.4018/ijaeis.2019070104.

Full text

Abstract:

In agriculture, crop yield prediction is critical. Crop yield depends on various features including geographic, climate and biological. This research article discusses five Feature Selection (FS) algorithms namely Sequential Forward FS, Sequential Backward Elimination FS, Correlation based FS, Random Forest Variable Importance and the Variance Inflation Factor algorithm for feature selection. Data used for the analysis was drawn from secondary sources of the Tamil Nadu state Agriculture Department for a period of 30 years. 75% of data was used for training and 25% data was used for testing. The performance of the feature selection algorithms are evaluated by Multiple Linear Regression. RMSE, MAE, R and RRMSE metrics are calculated for the feature selection algorithms. The adjusted R2 was used to find the optimum feature subset. Also, the time complexity of the algorithms was considered for the computation. The selected features are applied to Multilinear regression, Artificial Neural Network and M5Prime. MLR gives 85% of accuracy by using the features which are selected by SFFS algorithm.

APA, Harvard, Vancouver, ISO, and other styles

35

Liu, Ruidan, and Yu Dong. "Fault Diagnosis of Jointless Track Circuit Based on ReliefF-C4.5 Decision Tree." Journal of Physics: Conference Series 2383, no. 1 (December 1, 2022): 012047. http://dx.doi.org/10.1088/1742-6596/2383/1/012047.

Full text

Abstract:

At this stage, when judging the fault type of jointless track circuit, it mainly relies on the staff to analyze the computer monitoring data. This fault identification method has problems such as high dependence on staff and long fault identification cycle. In order to solve the above problems, the combination of the reliefF algorithm and the C4.5 decision tree is introduced into the fault diagnosis of the jointless track circuit. First, the reliefF algorithm is used to give weights to the selected 9 feature parameters according to the fault classification ability of the features. Second, sort the feature parameters according to the weight, and select the first seven feature parameters as the optimal feature subset. Finally, complete the training and testing of the classifier with the optimal feature subset. The experimental results show that the fault diagnosis accuracy of ReliefF-C4.5 Decision Tree is 93.33%, So the fault diagnosis model proposed in this paper is suitable for fault diagnosis of track circuit.

APA, Harvard, Vancouver, ISO, and other styles

36

Aversa, Rossella, Piero Coronica, Cristiano De Nobili, and Stefano Cozzini. "Deep Learning, Feature Learning, and Clustering Analysis for SEM Image Classification." Data Intelligence 2, no. 4 (October 2020): 513–28. http://dx.doi.org/10.1162/dint_a_00062.

Full text

Abstract:

In this paper, we report upon our recent work aimed at improving and adapting machine learning algorithms to automatically classify nanoscience images acquired by the Scanning Electron Microscope (SEM). This is done by coupling supervised and unsupervised learning approaches. We first investigate supervised learning on a ten-category data set of images and compare the performance of the different models in terms of training accuracy. Then, we reduce the dimensionality of the features through autoencoders to perform unsupervised learning on a subset of images in a selected range of scales (from 1 μm to 2 μm). Finally, we compare different clustering methods to uncover intrinsic structures in the images.

APA, Harvard, Vancouver, ISO, and other styles

37

A. Ramezan, Christopher, Timothy A. Warner, and Aaron E. Maxwell. "Evaluation of Sampling and Cross-Validation Tuning Strategies for Regional-Scale Machine Learning Classification." Remote Sensing 11, no. 2 (January 18, 2019): 185. http://dx.doi.org/10.3390/rs11020185.

Full text

Abstract:

High spatial resolution (1–5 m) remotely sensed datasets are increasingly being used to map land covers over large geographic areas using supervised machine learning algorithms. Although many studies have compared machine learning classification methods, sample selection methods for acquiring training and validation data for machine learning, and cross-validation techniques for tuning classifier parameters are rarely investigated, particularly on large, high spatial resolution datasets. This work, therefore, examines four sample selection methods—simple random, proportional stratified random, disproportional stratified random, and deliberative sampling—as well as three cross-validation tuning approaches—k-fold, leave-one-out, and Monte Carlo methods. In addition, the effect on the accuracy of localizing sample selections to a small geographic subset of the entire area, an approach that is sometimes used to reduce costs associated with training data collection, is investigated. These methods are investigated in the context of support vector machines (SVM) classification and geographic object-based image analysis (GEOBIA), using high spatial resolution National Agricultural Imagery Program (NAIP) orthoimagery and LIDAR-derived rasters, covering a 2,609 km2 regional-scale area in northeastern West Virginia, USA. Stratified-statistical-based sampling methods were found to generate the highest classification accuracy. Using a small number of training samples collected from only a subset of the study area provided a similar level of overall accuracy to a sample of equivalent size collected in a dispersed manner across the entire regional-scale dataset. There were minimal differences in accuracy for the different cross-validation tuning methods. The processing time for Monte Carlo and leave-one-out cross-validation were high, especially with large training sets. For this reason, k-fold cross-validation appears to be a good choice. Classifications trained with samples collected deliberately (i.e., not randomly) were less accurate than classifiers trained from statistical-based samples. This may be due to the high positive spatial autocorrelation in the deliberative training set. Thus, if possible, samples for training should be selected randomly; deliberative samples should be avoided.

APA, Harvard, Vancouver, ISO, and other styles

38

Chatterjee, Soumick, Kartik Prabhu, Mahantesh Pattadkal, Gerda Bortsova, Chompunuch Sarasaen, Florian Dubost, Hendrik Mattern, Marleen de Bruijne, Oliver Speck, and Andreas Nürnberger. "DS6: Deformation-Aware Semi-Supervised Learning: Application to Small Vessel Segmentation with Noisy Training Data." Journal of Imaging 8, no. 10 (September 22, 2022): 259. http://dx.doi.org/10.3390/jimaging8100259.

Full text

Abstract:

Blood vessels of the brain provide the human brain with the required nutrients and oxygen. As a vulnerable part of the cerebral blood supply, pathology of small vessels can cause serious problems such as Cerebral Small Vessel Diseases (CSVD). It has also been shown that CSVD is related to neurodegeneration, such as Alzheimer’s disease. With the advancement of 7 Tesla MRI systems, higher spatial image resolution can be achieved, enabling the depiction of very small vessels in the brain. Non-Deep Learning-based approaches for vessel segmentation, e.g., Frangi’s vessel enhancement with subsequent thresholding, are capable of segmenting medium to large vessels but often fail to segment small vessels. The sensitivity of these methods to small vessels can be increased by extensive parameter tuning or by manual corrections, albeit making them time-consuming, laborious, and not feasible for larger datasets. This paper proposes a deep learning architecture to automatically segment small vessels in 7 Tesla 3D Time-of-Flight (ToF) Magnetic Resonance Angiography (MRA) data. The algorithm was trained and evaluated on a small imperfect semi-automatically segmented dataset of only 11 subjects; using six for training, two for validation, and three for testing. The deep learning model based on U-Net Multi-Scale Supervision was trained using the training subset and was made equivariant to elastic deformations in a self-supervised manner using deformation-aware learning to improve the generalisation performance. The proposed technique was evaluated quantitatively and qualitatively against the test set and achieved a Dice score of 80.44 ± 0.83. Furthermore, the result of the proposed method was compared against a selected manually segmented region (62.07 resultant Dice) and has shown a considerable improvement (18.98%) with deformation-aware learning.

APA, Harvard, Vancouver, ISO, and other styles

39

Oglesby, Leslie W., Andrew R. Gallucci, and Christopher J. Wynveen. "Athletic Trainer Burnout: A Systematic Review of the Literature." Journal of Athletic Training 55, no. 4 (April 1, 2020): 416–30. http://dx.doi.org/10.4085/1062-6050-43-19.

Full text

Abstract:

Objective To identify the causes, effects, and prevalence of burnout in athletic trainers (ATs) identified in the literature. Data Sources EBSCO: SPORTDiscus and OneSearch were accessed, using the search terms athletic trainer AND burnout. Study Selection Studies selected for inclusion were peer reviewed, published in a journal, and written in English and investigated prevalence, causes, effects, or alleviation of AT burnout. Data Extraction The initial search yielded 558 articles. Articles that did not specifically involve ATs were excluded from further inspection. The remaining 83 full-text articles were reviewed. Of these 83 articles, 48 examined prevalence, causes, effects, or alleviation of AT burnout. An evaluation of the bibliographies of those 48 articles revealed 3 additional articles that were not initially identified but met the inclusion criteria. In total, 51 articles were included in data collection. Data Synthesis Articles were categorized based on investigation of prevalence, causes, effects, or alleviation of burnout. Articles were also categorized based on which subset of the athletic training population they observed (ie, athletic training students, certified graduate assistants, high school or collegiate staff members, academic faculty). Conclusions Burnout was observed in all studied subsets of the population (ie, students, graduate assistants, staff, faculty), and multiple causes of burnout were reported. Suggested causes of burnout in ATs included work-life conflict and organizational factors such as poor salaries, long hours, and difficulties dealing with the “politics and bureaucracy” of athletics. Effects of burnout in ATs included physical, emotional, and behavioral concerns (eg, intention to leave the job or profession).

APA, Harvard, Vancouver, ISO, and other styles

40

Kutyłowska, M. "Forecasting failure rate of water pipes." Water Supply 19, no. 1 (April 13, 2018): 264–73. http://dx.doi.org/10.2166/ws.2018.078.

Full text

Abstract:

Abstract This paper presents the results of failure rate prediction by means of support vector machines (SVM) – a non-parametric regression method. A hyperplane is used to divide the whole area in such a way that objects of different affiliation are separated from one another. The number of support vectors determines the complexity of the relations between dependent and independent variables. The calculations were performed using Statistical 12.0. Operational data for one selected zone of the water supply system for the period 2008–2014 were used for forecasting. The whole data set (in which data on distribution pipes were distinguished from those on house connections) for the years 2008–2014 was randomly divided into two subsets: a training subset – 75% (5 years) and a testing subset – 25% (2 years). Dependent variables (λr for the distribution pipes and λp for the house connections) were forecast using independent variables (the total length – Lr and Lp and number of failures – Nr and Np of the distribution pipes and the house connections, respectively). Four kinds of kernel functions (linear, polynomial, sigmoidal and radial basis functions) were applied. The SVM model based on the linear kernel function was found to be optimal for predicting the failure rate of each kind of water conduit. This model's maximum relative error of predicting failure rates λr and λp during the testing stage amounted to about 4% and 14%, respectively. The average experimental failure rates in the whole analysed period amounted to 0.18, 0.44, 0.17 and 0.24 fail./(km·year) for the distribution pipes, the house connections and the distribution pipes made of respectively PVC and cast iron.

APA, Harvard, Vancouver, ISO, and other styles

41

Zhang, Ling, Zixuan Zhang, Zhaohui Xue, and Hao Li. "Sensitive Feature Evaluation for Soil Moisture Retrieval Based on Multi-Source Remote Sensing Data with Few In-Situ Measurements: A Case Study of the Continental U.S." Water 13, no. 15 (July 21, 2021): 2003. http://dx.doi.org/10.3390/w13152003.

Full text

Abstract:

Soil moisture (SM) plays an important role for understanding Earth’s land and near-surface atmosphere interactions. Existing studies rarely considered using multi-source data and their sensitiveness to SM retrieval with few in-situ measurements. To solve this issue, we designed a SM retrieval method (Multi-MDA-RF) using random forest (RF) based on 29 features derived from passive microwave remote sensing data, optical remote sensing data, land surface models (LSMs), and other auxiliary data. To evaluate the importance of different features to SM retrieval, we first compared 10 filter or embedded type feature selection methods with sequential forward selection (SFS). Then, RF was employed to establish a nonlinear relationship between the in-situ SM measurements from sparse network stations and the optimal feature subset. The experiments were conducted in the continental U.S. (CONUS) using in-situ measurements during August 2015, with only 5225 training samples covering the selected feature subset. The experimental results show that mean decrease accuracy (MDA) is better than other feature selection methods, and Multi-MDA-RF outperforms the back-propagation neural network (BPNN) and generalized regression neural network (GRNN), with the R and unbiased root-mean-square error (ubRMSE) values being 0.93 and 0.032 cm3/cm3, respectively. In comparison with other SM products, Multi-MDA-RF is more accurate and can well capture the SM spatial dynamics.

APA, Harvard, Vancouver, ISO, and other styles

42

Braken, Rebecca, Alexander Paulus, André Pomp, and Tobias Meisen. "An Evaluation of Link Prediction Approaches in Few-Shot Scenarios." Electronics 12, no. 10 (May 19, 2023): 2296. http://dx.doi.org/10.3390/electronics12102296.

Full text

Abstract:

Semantic models are utilized to add context information to datasets and make data accessible and understandable in applications such as dataspaces. Since the creation of such models is a time-consuming task that has to be performed by a human expert, different approaches to automate or support this process exist. A recurring problem is the task of link prediction, i.e., the automatic prediction of links between nodes in a graph, in this case semantic models, usually based on machine learning techniques. While, in general, semantic models are trained and evaluated on large reference datasets, these conditions often do not match the domain-specific real-world applications wherein only a small amount of existing data is available (the cold-start problem). In this study, we evaluated the performance of link prediction algorithms when datasets of a smaller size were used for training (few-shot scenarios). Based on the reported performance evaluation, we first selected algorithms for link prediction and then evaluated the performance of the selected subset using multiple reduced datasets. The results showed that two of the three selected algorithms were suitable for the task of link prediction in few-shot scenarios.

APA, Harvard, Vancouver, ISO, and other styles

43

Solarz, A., R. Thomas, F. M. Montenegro-Montes, M. Gromadzki, E. Donoso, M. Koprowski, L. Wyrzykowski, C. G. Diaz, E. Sani, and M. Bilicki. "Spectroscopic observations of the machine-learning selected anomaly catalogue from the AllWISE Sky Survey." Astronomy & Astrophysics 642 (October 2020): A103. http://dx.doi.org/10.1051/0004-6361/202038439.

Full text

Abstract:

We present the results of a programme to search and identify the nature of unusual sources within the All-sky Wide-field Infrared Survey Explorer (WISE) that is based on a machine-learning algorithm for anomaly detection, namely one-class support vector machines (OCSVM). Designed to detect sources deviating from a training set composed of known classes, this algorithm was used to create a model for the expected data based on WISE objects with spectroscopic identifications in the Sloan Digital Sky Survey. Subsequently, it marked as anomalous those sources whose WISE photometry was shown to be inconsistent with this model. We report the results from optical and near-infrared spectroscopy follow-up observations of a subset of 36 bright (gAB < 19.5) objects marked as “anomalous” by the OCSVM code to verify its performance. Among the observed objects, we identified three main types of sources: (i) low redshift (z ∼ 0.03 − 0.15) galaxies containing large amounts of hot dust (53%), including three Wolf-Rayet galaxies; (ii) broad-line quasi-stellar objects (QSOs) (33%) including low-ionisation broad absorption line (LoBAL) quasars and a rare QSO with strong and narrow ultraviolet iron emission; (iii) Galactic objects in dusty phases of their evolution (3%). The nature of four of these objects (11%) remains undetermined due to low signal-to-noise or featureless spectra. The current data show that the algorithm works well at detecting rare but not necessarily unknown objects among the brightest candidates. They mostly represent peculiar sub-types of otherwise well-known sources. To search for even more unusual sources, a more complete and balanced training set should be created after including these rare sub-species of otherwise abundant source classes, such as LoBALs. Such an iterative approach will ideally bring us closer to improving the strategy design for the detection of rarer sources contained within the vast data store of the AllWISE survey.

APA, Harvard, Vancouver, ISO, and other styles

44

Jiang, Bingbing, Xingyu Wu, Kui Yu, and Huanhuan Chen. "Joint Semi-Supervised Feature Selection and Classification through Bayesian Approach." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 3983–90. http://dx.doi.org/10.1609/aaai.v33i01.33013983.

Full text

Abstract:

With the increasing data dimensionality, feature selection has become a fundamental task to deal with high-dimensional data. Semi-supervised feature selection focuses on the problem of how to learn a relevant feature subset in the case of abundant unlabeled data with few labeled data. In recent years, many semi-supervised feature selection algorithms have been proposed. However, these algorithms are implemented by separating the processes of feature selection and classifier training, such that they cannot simultaneously select features and learn a classifier with the selected features. Moreover, they ignore the difference of reliability inside unlabeled samples and directly use them in the training stage, which might cause performance degradation. In this paper, we propose a joint semi-supervised feature selection and classification algorithm (JSFS) which adopts a Bayesian approach to automatically select the relevant features and simultaneously learn a classifier. Instead of using all unlabeled samples indiscriminately, JSFS associates each unlabeled sample with a self-adjusting weight to distinguish the difference between them, which can effectively eliminate the irrelevant unlabeled samples via introducing a left-truncated Gaussian prior. Experiments on various datasets demonstrate the effectiveness and superiority of JSFS.

APA, Harvard, Vancouver, ISO, and other styles

45

Turki, Turki, Zhi Wei, and Jason T. L. Wang. "A transfer learning approach via procrustes analysis and mean shift for cancer drug sensitivity prediction." Journal of Bioinformatics and Computational Biology 16, no. 03 (June 2018): 1840014. http://dx.doi.org/10.1142/s0219720018400140.

Full text

Abstract:

Transfer learning (TL) algorithms aim to improve the prediction performance in a target task (e.g. the prediction of cisplatin sensitivity in triple-negative breast cancer patients) via transferring knowledge from auxiliary data of a related task (e.g. the prediction of docetaxel sensitivity in breast cancer patients), where the distribution and even the feature space of the data pertaining to the tasks can be different. In real-world applications, we sometimes have a limited training set in a target task while we have auxiliary data from a related task. To obtain a better prediction performance in the target task, supervised learning requires a sufficiently large training set in the target task to perform well in predicting future test examples of the target task. In this paper, we propose a TL approach for cancer drug sensitivity prediction, where our approach combines three techniques. First, we shift the representation of a subset of examples from auxiliary data of a related task to a representation closer to a target training set of a target task. Second, we align the shifted representation of the selected examples of the auxiliary data to the target training set to obtain examples with representation aligned to the target training set. Third, we train machine learning algorithms using both the target training set and the aligned examples. We evaluate the performance of our approach against baseline approaches using the Area Under the receiver operating characteristic (ROC) Curve (AUC) on real clinical trial datasets pertaining to multiple myeloma, nonsmall cell lung cancer, triple-negative breast cancer, and breast cancer. Experimental results show that our approach is better than the baseline approaches in terms of performance and statistical significance.

APA, Harvard, Vancouver, ISO, and other styles

46

Zhang, Ying. "Real-Time Detection of Lower Limb Training Stability Function Based on Smart Wearable Sensors." Journal of Sensors 2022 (July 31, 2022): 1–12. http://dx.doi.org/10.1155/2022/7503668.

Full text

Abstract:

The research of smart wearable sensors in limb training has great application significance. In the face of real-time detection requirements, this paper proposes a hardware solution for the stability function of lower limb training based on the theory of intelligent wearable sensors. For the specific implementation circuit of the device, considering the reliability of the system, the system implements antijamming design for the hardware circuit from three aspects: adding decoupling capacitors, optimizing layout and wiring, and rationally grounding the hardware circuit, and performs moving average filtering on the collected sensor data to remove noise, which solves the problem of sensor data precision issues. During the simulation process, by analyzing the changes of acceleration, angular velocity, and attitude angle under different lower limb training activities and different wearing positions, the characteristics of stability combined acceleration, combined angular velocity, and attitude angle were constructed, and the stability mean, variance, and attitude angle were extracted. The experimental results show that the extracted 57 feature dimensions are first reduced to 21 dimensions by the principal component analysis algorithm, and then, the optimal feature subset is selected by the encapsulation method, and the dimension is reduced to 9. The proposed multifeature fusion algorithm has higher accuracy, and the maximum has increased by 6.5%, effectively improving the accuracy of the lower limb training stability function detection algorithm.

APA, Harvard, Vancouver, ISO, and other styles

47

Hildebrand, J., S. Schulz, R. Richter, and J. Döllner. "SIMULATING LIDAR TO CREATE TRAINING DATA FOR MACHINE LEARNING ON 3D POINT CLOUDS." ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences X-4/W2-2022 (October 14, 2022): 105–12. http://dx.doi.org/10.5194/isprs-annals-x-4-w2-2022-105-2022.

Full text

Abstract:

Abstract. 3D point clouds represent an essential category of geodata used in a variety of geoinformation applications. Typically, these applications require additional semantics to operate on subsets of the data like selected objects or surface categories. Machine learning approaches are increasingly used for classification. They operate directly on 3D point clouds and require large amounts of training data. An adequate amount of high-quality training data is often not available or has to be created manually. In this paper, we introduce a system for virtual laser scanning to create 3D point clouds with semantics information by utilizing 3D models. In particular, our system creates 3D point clouds with the same characteristics regarding density, occlusion, and scan pattern as those 3D point clouds captured in the real world. We evaluate our system with different data sets and show the potential to use the data to train neural networks for 3D point cloud classification.

APA, Harvard, Vancouver, ISO, and other styles

48

Pyenson, Bruce, Maggie Alston, Jeffrey Gomberg, Feng Han, Nikhil Khandelwal, Motoharu Dei, Monica Son, and Jaime Vora. "Applying Machine Learning Techniques to Identify Undiagnosed Patients with Exocrine Pancreatic Insufficiency." Journal of Health Economics and Outcomes Research 6, no. 2 (February 14, 2019): 32–46. http://dx.doi.org/10.36469/9727.

Full text

Abstract:

Background: Exocrine pancreatic insufficiency (EPI) is a serious condition characterized by a lack of functional exocrine pancreatic enzymes and the resultant inability to properly digest nutrients. EPI can be caused by a variety of disorders, including chronic pancreatitis, pancreatic cancer, and celiac disease. EPI remains underdiagnosed because of the nonspecific nature of clinical symptoms, lack of an ideal diagnostic test, and the inability to easily identify affected patients using administrative claims data. Objectives: To develop a machine learning model that identifies patients in a commercial medical claims database who likely have EPI but are undiagnosed. Methods: A machine learning algorithm was developed in Scikit-learn, a Python module. The study population, selected from the 2014 Truven MarketScan® Commercial Claims Database, consisted of patients with EPI-prone conditions. Patients were labeled with 290 condition category flags and split into actual positive EPI cases, actual negative EPI cases, and unlabeled cases. The study population was then randomly divided into a training subset and a testing subset. The training subset was used to determine the performance metrics of 27 models and to select the highest performing model, and the testing subset was used to evaluate performance of the best machine learning model. Results: The study population consisted of 2088 actual positive EPI cases, 1077 actual negative EPI cases, and 437 530 unlabeled cases. In the best performing model, the precision, recall, and accuracy were 0.91, 0.80, and 0.86, respectively. The best-performing model estimated that the number of patients likely to have EPI was about 12 times the number of patients directly identified as EPI-positive through a claims analysis in the study population. The most important features in assigning EPI probability were the presence or absence of diagnosis codes related to pancreatic and digestive conditions. Conclusions: Machine learning techniques demonstrated high predictive power in identifying patients with EPI and could facilitate an enhanced understanding of its etiology and help to identify patients for possible diagnosis and treatment.

APA, Harvard, Vancouver, ISO, and other styles

49

Pistoia, Jenny, Nadia Pinardi, Paolo Oddo, Matthew Collins, Gerasimos Korres, and Yann Drillet. "Development of super-ensemble techniques for ocean analyses: the Mediterranean Sea case." Natural Hazards and Earth System Sciences 16, no. 8 (August 9, 2016): 1807–19. http://dx.doi.org/10.5194/nhess-16-1807-2016.

Full text

Abstract:

Abstract. A super-ensemble methodology is proposed to improve the quality of short-term ocean analyses for sea surface temperature (SST) in the Mediterranean Sea. The methodology consists of a multiple linear regression technique applied to a multi-physics multi-model super-ensemble (MMSE) data set. This is a collection of different operational forecasting analyses together with ad hoc simulations, created by modifying selected numerical model parameterizations. A new linear regression algorithm based on empirical orthogonal function filtering techniques is shown to be efficient in preventing overfitting problems, although the best performance is achieved when a simple spatial filter is applied after the linear regression. Our results show that the MMSE methodology improves the ocean analysis SST estimates with respect to the best ensemble member (BEM) and that the performance is dependent on the selection of an unbiased operator and the length of training. The quality of the MMSE data set has the largest impact on the MMSE analysis root mean square error (RMSE) evaluated with respect to observed satellite SST. The MMSE analysis estimates are also affected by training period length, with the longest period leading to the smoothest estimates. Finally, lower RMSE analysis estimates result from the following: a 15-day training period, an overconfident MMSE data set (a subset with the higher-quality ensemble members) and the least-squares algorithm being filtered a posteriori.

APA, Harvard, Vancouver, ISO, and other styles

50

Maya Gopal, P. S., and R. Bhargavi. "Optimum Feature Subset for Optimizing Crop Yield Prediction Using Filter and Wrapper Approaches." Applied Engineering in Agriculture 35, no. 1 (2019): 9–14. http://dx.doi.org/10.13031/aea.12938.

Full text

Abstract:

Abstract. In agriculture, crop yield prediction is critical. Crop yield depends on various features which can be categorized as geographical, climatic, and biological. Geographical features consist of cultivable land in hectares, canal length to cover the cultivable land, number of tanks and tube wells available for irrigation. Climatic features consist of rainfall, temperature, and radiation. Biological features consist of seeds, minerals, and nutrients. In total, 15 features were considered for this study to understand features impact on paddy crop yield for all seasons of each year. For selecting vital features, five filter and wrapper approaches were applied. For predicting accuracy of features selection algorithm, Multiple Linear Regression (MLR) model was used. The RMSE, MAE, R, and RRMSE metrics were used to evaluate the performance of feature selection algorithms. Data used for the analysis was drawn from secondary sources of state Agriculture Department, Government of Tamil Nadu, India, for over 30 years. Seventy-five percent of data was used for training and 25% was used for testing. Low computational time was also considered for the selection of best feature subset. Outcome of all feature selection algorithms have given similar results in the RMSE, RRMSE, R, and MAE values. The adjusted R2 value was used to find the optimum feature subset despite all the deviations. The evaluation of the dataset used in this work shows that total area of cultivation, number of tanks and open wells used for irrigation, length of canals used for irrigation, and average maximum temperature during the season of the crop are the best features for better crop yield prediction on the study area. The MLR gives 85% of model accuracy for the selected features with low computational time. Keywords: Feature selection algorithm, Model validation, Multiple linear regression, Performance metrics.

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!