Log in

Relevant bibliographies by topics / Data imbalance problem / Journal articles

Journal articles on the topic 'Data imbalance problem'

To see the other types of publications on this topic, follow the link: Data imbalance problem.

Author: Grafiati

Published: 10 December 2022

Last updated: 28 January 2023

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Data imbalance problem.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Tiwari, Himani. "Improvising Balancing Methods for Classifying Imbalanced Data." International Journal for Research in Applied Science and Engineering Technology 9, no. 9 (September 30, 2021): 1535–43. http://dx.doi.org/10.22214/ijraset.2021.38225.

Full text

Abstract:

Abstract: Class Imbalance problem is one of the most challenging problems faced by the machine learning community. As we refer the imbalance to various instances in class of being relatively low as compare to other data. A number of over - sampling and under-sampling approaches have been applied in an attempt to balance the classes. This study provides an overview of the issue of class imbalance and attempts to examine various balancing methods for dealing with this problem. In order to illustrate the differences, an experiment is conducted using multiple simulated data sets for comparing the performance of these oversampling methods on different classifiers based on various evaluation criteria. In addition, the effect of different parameters, such as number of features and imbalance ratio, on the classifier performance is also evaluated. Keywords: Imbalanced learning, Over-sampling methods, Under-sampling methods, Classifier performances, Evaluationmetrices

APA, Harvard, Vancouver, ISO, and other styles

2

Isabella, S. Josephine, Sujatha Srinivasan, and G. Suseendran. "A Framework Using Binary Cross Entropy - Gradient Boost Hybrid Ensemble Classifier for Imbalanced Data Classification." Webology 18, no. 1 (April 29, 2021): 104–20. http://dx.doi.org/10.14704/web/v18i1/web18076.

Full text

Abstract:

During the big data era, there is a continuous occurrence of developing the learning of imbalanced data gives a pathway for the research field along with data mining and machine learning concepts. In recent years, Big Data and Big Data Analytics having high eminence due to data exploration by many of the applications in real-time. Using machine learning will be a greater solution to solve the difficulties that occur when we learn the imbalanced data. Many real-world applications have to predict the solutions for highly imbalanced datasets with the imbalanced target variable. In most of the cases, the target variable assigns or having the least occurrences of the target values due to the sort of imbalances associated with things or events strongly applicable for the users who avail the solutions (for example, results of stock changes, fraud finding, network security, etc.). The expansion of the availability of data due to the rise of big data from the network systems such as security, internet transactions, finance manipulations, surveillance of CCTV or other devices makes the chance to the critical study of insufficient knowledge from the imbalance data when supporting the decision making processes. The data imbalance occurrence is a challenge to the research field. In recent trends, there is more data level and an algorithm level method is being upgraded constantly and leads to develop a new hybrid framework to solve this problem in classification. Classifying the imbalanced data is a challenging task in the field of big data analytics. This study mainly concentrates on the problem existing in most cases of real-world applications as an imbalance occurs in the data. This difficulty present due to the data distribution with skewed nature. We have analyses the data imbalance and find the solution. This paper concentrates mainly on finding a better solution to this nature of the problem to be solved with the proposed framework using a hybrid ensemble classifier based on the Binary Cross-Entropy method as loss function along with the Gradient Boost Algorithm.

APA, Harvard, Vancouver, ISO, and other styles

3

Yogi, Abhishek, and Ratul Dey. "CLASS IMBALANCE PROBLEM IN DATA SCIENCE: REVIEW." International Research Journal of Computer Science 9, no. 4 (April 30, 2022): 56–60. http://dx.doi.org/10.26562/irjcs.2021.v0904.002.

Full text

Abstract:

In last few years there are many changes and evolution has been done on the classification of data. Application area of technology increases then the size of data also increases. Classification of data becomes difficult because of imbalance nature and unbounded size of data. Class imbalance problem becomes the greatest issue in the data mining. Imbalance problem occurs where one of the two classes having more sample than the other classes. The most of the algorithms are focusing on the classification of major sample while ignoring minority sample. The minority samples are those samples that rarely occur but are very important. There are different methods available for the classification of imbalance data set which is divided into three categories i.e. the algorithmic approach, feature selection approach and the data processing approach. These approaches have their own advantages and disadvantages. In this paper a systematic study of each process is defined which gives the right direction for research in class imbalance problem.

APA, Harvard, Vancouver, ISO, and other styles

4

Rendón, Eréndira, Roberto Alejo, Carlos Castorena, Frank J. Isidro-Ortega, and Everardo E. Granda-Gutiérrez. "Data Sampling Methods to Deal With the Big Data Multi-Class Imbalance Problem." Applied Sciences 10, no. 4 (February 14, 2020): 1276. http://dx.doi.org/10.3390/app10041276.

Full text

Abstract:

The class imbalance problem has been a hot topic in the machine learning community in recent years. Nowadays, in the time of big data and deep learning, this problem remains in force. Much work has been performed to deal to the class imbalance problem, the random sampling methods (over and under sampling) being the most widely employed approaches. Moreover, sophisticated sampling methods have been developed, including the Synthetic Minority Over-sampling Technique (SMOTE), and also they have been combined with cleaning techniques such as Editing Nearest Neighbor or Tomek’s Links (SMOTE+ENN and SMOTE+TL, respectively). In the big data context, it is noticeable that the class imbalance problem has been addressed by adaptation of traditional techniques, relatively ignoring intelligent approaches. Thus, the capabilities and possibilities of heuristic sampling methods on deep learning neural networks in big data domain are analyzed in this work, and the cleaning strategies are particularly analyzed. This study is developed on big data, multi-class imbalanced datasets obtained from hyper-spectral remote sensing images. The effectiveness of a hybrid approach on these datasets is analyzed, in which the dataset is cleaned by SMOTE followed by the training of an Artificial Neural Network (ANN) with those data, while the neural network output noise is processed with ENN to eliminate output noise; after that, the ANN is trained again with the resultant dataset. Obtained results suggest that best classification outcome is achieved when the cleaning strategies are applied on an ANN output instead of input feature space only. Consequently, the need to consider the classifier’s nature when the classical class imbalance approaches are adapted in deep learning and big data scenarios is clear.

APA, Harvard, Vancouver, ISO, and other styles

5

SUN, YANMIN, ANDREW K. C. WONG, and MOHAMED S. KAMEL. "CLASSIFICATION OF IMBALANCED DATA: A REVIEW." International Journal of Pattern Recognition and Artificial Intelligence 23, no. 04 (June 2009): 687–719. http://dx.doi.org/10.1142/s0218001409007326.

Full text

Abstract:

Classification of data with imbalanced class distribution has encountered a significant drawback of the performance attainable by most standard classifier learning algorithms which assume a relatively balanced class distribution and equal misclassification costs. This paper provides a review of the classification of imbalanced data regarding: the application domains; the nature of the problem; the learning difficulties with standard classifier learning algorithms; the learning objectives and evaluation measures; the reported research solutions; and the class imbalance problem in the presence of multiple classes.

APA, Harvard, Vancouver, ISO, and other styles

6

Liu, Tian Yu. "Research on Feature Selection for Imbalanced Problem from Fault Diagnosis on Gear." Advanced Materials Research 466-467 (February 2012): 886–90. http://dx.doi.org/10.4028/www.scientific.net/amr.466-467.886.

Full text

Abstract:

Defect is one of the important factors resulting in gear fault, so it is significant to study the technology of defect diagnosis for gear. Class imbalance problem is encountered in the fault diagnosis, which causes seriously negative effect on the performance of classifiers that assume a balanced distribution of classes. Though it is critical, few previous works paid attention to this class imbalance problem in the fault diagnosis of gear. In imbalanced problems, some features are redundant and even irrelevant. These features will hurt the generalization performance of learning machines. Here we propose PREE (Prediction Risk based feature selectionfor EasyEnsemble) to solve the class imbalanced problem in the fault diagnosis of gear. Experimental results on UCI data sets and gear data set show that PREE improves the classification performance and prediction ability on the imbalanced dataset.

APA, Harvard, Vancouver, ISO, and other styles

7

Khoshgoftaar, Taghi M., Naeem Seliya, and Dennis J. Drown. "Evolutionary data analysis for the class imbalance problem." Intelligent Data Analysis 14, no. 1 (January 22, 2010): 69–88. http://dx.doi.org/10.3233/ida-2010-0409.

Full text

APA, Harvard, Vancouver, ISO, and other styles

8

., Hartono, Opim Salim Sitompul, Erna Budhiarti Nababan, Tulus ., Dahlan Abdullah, and Ansari Saleh Ahmar. "A New Diversity Technique for Imbalance Learning Ensembles." International Journal of Engineering & Technology 7, no. 2.14 (April 8, 2018): 478. http://dx.doi.org/10.14419/ijet.v7i2.11251.

Full text

Abstract:

Data mining and machine learning techniques designed to solve classification problems require balanced class distribution. However, in reality sometimes the classification of datasets indicates the existence of a class represented by a large number of instances whereas there are classes with far fewer instances. This problem is known as the class imbalance problem. Classifier Ensembles is a method often used in overcoming class imbalance problems. Data Diversity is one of the cornerstones of ensembles. An ideal ensemble system should have accurrate individual classifiers and if there is an error it is expected to occur on different objects or instances. This research will present the results of overview and experimental study using Hybrid Approach Redefinition (HAR) Method in handling class imbalance and at the same time expected to get better data diversity. This research will be conducted using 6 datasets with different imbalanced ratios and will be compared with SMOTEBoost which is one of the Re-Weighting method which is often used in handling class imbalance. This study shows that the data diversity is related to performance in the imbalance learning ensembles and the proposed methods can obtain better data diversity.

APA, Harvard, Vancouver, ISO, and other styles

9

Naboureh, Amin, Ainong Li, Jinhu Bian, Guangbin Lei, and Meisam Amani. "A Hybrid Data Balancing Method for Classification of Imbalanced Training Data within Google Earth Engine: Case Studies from Mountainous Regions." Remote Sensing 12, no. 20 (October 11, 2020): 3301. http://dx.doi.org/10.3390/rs12203301.

Full text

Abstract:

Distribution of Land Cover (LC) classes is mostly imbalanced with some majority LC classes dominating against minority classes in mountainous areas. Although standard Machine Learning (ML) classifiers can achieve high accuracies for majority classes, they largely fail to provide reasonable accuracies for minority classes. This is mainly due to the class imbalance problem. In this study, a hybrid data balancing method, called the Partial Random Over-Sampling and Random Under-Sampling (PROSRUS), was proposed to resolve the class imbalance issue. Unlike most data balancing techniques which seek to fully balance datasets, PROSRUS uses a partial balancing approach with hundreds of fractions for majority and minority classes to balance datasets. For this, time-series of Landsat-8 and SRTM topographic data along with various spectral indices and topographic data were used over three mountainous sites within the Google Earth Engine (GEE) cloud platform. It was observed that PROSRUS had better performance than several other balancing methods and increased the accuracy of minority classes without a reduction in overall classification accuracy. Furthermore, adopting complementary information, particularly topographic data, considerably increased the accuracy of minority classes in mountainous areas. Finally, the obtained results from PROSRUS indicated that every imbalanced dataset requires a specific fraction(s) for addressing the class imbalance problem, because different datasets contain various characteristics.

APA, Harvard, Vancouver, ISO, and other styles

10

Liu, Zhenyan, Yifei Zeng, Pengfei Zhang, Jingfeng Xue, Ji Zhang, and Jiangtao Liu. "An Imbalanced Malicious Domains Detection Method Based on Passive DNS Traffic Analysis." Security and Communication Networks 2018 (June 20, 2018): 1–7. http://dx.doi.org/10.1155/2018/6510381.

Full text

Abstract:

Although existing malicious domains detection techniques have shown great success in many real-world applications, the problem of learning from imbalanced data is rarely concerned with this day. But the actual DNS traffic is inherently imbalanced; thus how to build malicious domains detection model oriented to imbalanced data is a very important issue worthy of study. This paper proposes a novel imbalanced malicious domains detection method based on passive DNS traffic analysis, which can effectively deal with not only the between-class imbalance problem but also the within-class imbalance problem. The experiments show that this proposed method has favorable performance compared to the existing algorithms.

APA, Harvard, Vancouver, ISO, and other styles

11

Peng, Ching-Tung, Yung-Kuan Chan, and Shyr-Shen Yu. "Data Imbalance Immunity Bone Age Assessment System Using Independent Autoencoders." Applied Sciences 12, no. 16 (August 9, 2022): 7974. http://dx.doi.org/10.3390/app12167974.

Full text

Abstract:

Bone age assessment (BAA) is an important indicator of child maturity. Generally, a person is evaluated for bone age mostly during puberty stage; compared to toddlers and post-puberty stages, the data of bone age at puberty stage are much easier to obtain. As a result, the amount of bone age data collected at the toddler and post-puberty stages are often much fewer than the amount of bone age data collected at the puberty stage. This so-called data imbalance problem affects the prediction accuracy. To deal with this problem, in this paper, a data imbalance immunity bone age assessment (DIIBAA) system is proposed. It consists of two branches, the first branch consists of a CNN-based autoencoder and a CNN-based scoring network. This branch builds three autoencoders for the bone age data of toddlers, puberty, and post-puberty stages, respectively. Since the three types of autoencoders do not interfere with each other, there is no data imbalance problem in the first branch. After that, the outputs of the three autoencoders are input into the scoring network, and the autoencoder which produces the image with the highest score is regarded as the final prediction result. In the experiments, imbalanced training data with a positive and negative sample ratio of 1:2 are used, which has been alleviated compared to the original highly imbalanced data. In addition, since the scoring network converts the classification problem into an image quality scoring problem, it does not use the classification features of the image. Therefore, in the second branch, we also add the classification features to the DIIBAA system. At this time, DIIBAA considers both image quality features and classification features. Finally, the DenseNet169-based autoencoders are employed in the experiments, and the obtained evaluation accuracies are improved compared to the baseline network.

APA, Harvard, Vancouver, ISO, and other styles

12

Wongvorachan, Tarid, Surina He, and Okan Bulut. "A Comparison of Undersampling, Oversampling, and SMOTE Methods for Dealing with Imbalanced Classification in Educational Data Mining." Information 14, no. 1 (January 16, 2023): 54. http://dx.doi.org/10.3390/info14010054.

Full text

Abstract:

Educational data mining is capable of producing useful data-driven applications (e.g., early warning systems in schools or the prediction of students’ academic achievement) based on predictive models. However, the class imbalance problem in educational datasets could hamper the accuracy of predictive models as many of these models are designed on the assumption that the predicted class is balanced. Although previous studies proposed several methods to deal with the imbalanced class problem, most of them focused on the technical details of how to improve each technique, while only a few focused on the application aspect, especially for the application of data with different imbalance ratios. In this study, we compared several sampling techniques to handle the different ratios of the class imbalance problem (i.e., moderately or extremely imbalanced classifications) using the High School Longitudinal Study of 2009 dataset. For our comparison, we used random oversampling (ROS), random undersampling (RUS), and the combination of the synthetic minority oversampling technique for nominal and continuous (SMOTE-NC) and RUS as a hybrid resampling technique. We used the Random Forest as our classification algorithm to evaluate the results of each sampling technique. Our results show that random oversampling for moderately imbalanced data and hybrid resampling for extremely imbalanced data seem to work best. The implications for educational data mining applications and suggestions for future research are discussed.

APA, Harvard, Vancouver, ISO, and other styles

13

Lu, Yang, Yiu-Ming Cheung, and Yuan Yan Tang. "Bayes Imbalance Impact Index: A Measure of Class Imbalanced Data Set for Classification Problem." IEEE Transactions on Neural Networks and Learning Systems 31, no. 9 (September 2020): 3525–39. http://dx.doi.org/10.1109/tnnls.2019.2944962.

Full text

APA, Harvard, Vancouver, ISO, and other styles

14

Alfhaid, Mashaal A., and Manal Abdullah. "Classification of Imbalanced Data Stream: Techniques and Challenges." Transactions on Machine Learning and Artificial Intelligence 9, no. 2 (April 23, 2021): 36–52. http://dx.doi.org/10.14738/tmlai.92.9964.

Full text

Abstract:

As the number of generated data increases every day, this has brought the importance of data mining and knowledge extraction. In traditional data mining, offline status can be used for knowledge extraction. Nevertheless, dealing with stream data mining is different due to continuously arriving data that can be processed at a single scan besides the appearance of concept drift. As the pre-processing stage is critical in knowledge extraction, imbalanced stream data gain significant popularity in the last few years among researchers. Many real-world applications suffer from class imbalance including medical, business, fraud detection and etc. Learning from the supervised model includes classes whether it is binary- or multi-classes. These classes are often imbalance where it is divided into the majority (negative) class and minority (positive) class, which can cause a bias toward the majority class that leads to skew in predictive performance models. Handles imbalance streaming data is mandatory for more accurate and reliable learning models. In this paper, we will present an overview of data stream mining and its tools. Besides, summarize the problem of class imbalance and its different approaches. In addition, researchers will present the popular evaluation metrics and challenges prone from imbalanced streaming data.

APA, Harvard, Vancouver, ISO, and other styles

15

Malhotra, Ruchika, and Kusum Lata. "Using Ensembles for Class-Imbalance Problem to Predict Maintainability of Open Source Software." International Journal of Reliability, Quality and Safety Engineering 27, no. 05 (March 6, 2020): 2040011. http://dx.doi.org/10.1142/s0218539320400112.

Full text

Abstract:

To facilitate software maintenance and save the maintenance cost, numerous machine learning (ML) techniques have been studied to predict the maintainability of software modules or classes. An abundant amount of effort has been put by the research community to develop software maintainability prediction (SMP) models by relating software metrics to the maintainability of modules or classes. When software classes demanding the high maintainability effort (HME) are less as compared to the low maintainability effort (LME) classes, the situation leads to imbalanced datasets for training the SMP models. The imbalanced class distribution in SMP datasets could be a dilemma for various ML techniques because, in the case of an imbalanced dataset, minority class instances are either misclassified by the ML techniques or get discarded as noise. The recent development in predictive modeling has ascertained that ensemble techniques can boost the performance of ML techniques by collating their predictions. Ensembles themselves do not solve the class-imbalance problem much. However, aggregation of ensemble techniques with the certain techniques to handle class-imbalance problem (e.g., data resampling) has led to several proposals in research. This paper evaluates the performance of ensembles for the class-imbalance in the domain of SMP. The ensembles for class-imbalance problem (ECIP) are the modification of ensembles which pre-process the imbalanced data using data resampling before the learning process. This study experimentally compares the performance of several ECIP using performance metrics Balance and g-Mean over eight Apache software datasets. The results of the study advocate that for imbalanced datasets, ECIP improves the performance of SMP models as compared to classic ensembles.

APA, Harvard, Vancouver, ISO, and other styles

16

Obrubova, V., and M. Ozerova. "IMBALANCE OF CLASSES IN SOLVING THE PROBLEM OF SOCIAL NETWORKS USER CLASSIFICATION FOR PROFESSIONAL ORIENTATION." National Association of Scientists 2, no. 68 (July 10, 2021): 41–43. http://dx.doi.org/10.31618/nas.2413-5291.2021.2.68.449.

Full text

Abstract:

The problem of data imbalance is often underestimated when solving classification problems. A classification model that looks well trained on your data and gives a good recognition rate may not be reliable. Consideration of this problem in the specific task of classifying users of social networks will make it possible to understand how, why and, most importantly, when it is necessary to get rid from data imbalances.

APA, Harvard, Vancouver, ISO, and other styles

17

Terzi, Duygu Sinanc, and Seref Sagiroglu. "A New Big Data Model Using Distributed Cluster-Based Resampling for Class-Imbalance Problem." Applied Computer Systems 24, no. 2 (December 1, 2019): 104–10. http://dx.doi.org/10.2478/acss-2019-0013.

Full text

Abstract:

Abstract The class imbalance problem, one of the common data irregularities, causes the development of under-represented models. To resolve this issue, the present study proposes a new cluster-based MapReduce design, entitled Distributed Cluster-based Resampling for Imbalanced Big Data (DIBID). The design aims at modifying the existing dataset to increase the classification success. Within the study, DIBID has been implemented on public datasets under two strategies. The first strategy has been designed to present the success of the model on data sets with different imbalanced ratios. The second strategy has been designed to compare the success of the model with other imbalanced big data solutions in the literature. According to the results, DIBID outperformed other imbalanced big data solutions in the literature and increased area under the curve values between 10 % and 24 % through the case study.

APA, Harvard, Vancouver, ISO, and other styles

18

Gong, Chunlin, and Liangxian Gu. "A Novel SMOTE-Based Classification Approach to Online Data Imbalance Problem." Mathematical Problems in Engineering 2016 (2016): 1–14. http://dx.doi.org/10.1155/2016/5685970.

Full text

Abstract:

In many practical engineering applications, data are usually collected in online pattern. However, if the classes of these data are severely imbalanced, the classification performance will be restricted. In this paper, a novel classification approach is proposed to solve the online data imbalance problem by integrating a fast and efficient learning algorithm, that is, Extreme Learning Machine (ELM), and a typical sampling strategy, that is, the synthetic minority oversampling technique (SMOTE). To reduce the severe imbalance, the granulation division for major-class samples is made according to the samples’ distribution characteristic, and the original samples are replaced by the obtained granule core to prepare a balanced sample set. In online stage, we firstly make granulation division for minor-class and then conduct oversampling using SMOTE in the region around granule core and granule border. Therefore, the training sample set is gradually balanced and the online ELM model is dynamically updated. We also theoretically introduce fuzzy information entropy to prove that the proposed approach has the lower bound of model reliability after undersampling. Numerical experiments are conducted on two different kinds of datasets, and the results demonstrate that the proposed approach outperforms some state-of-the-art methods in terms of the generalization performance and numerical stability.

APA, Harvard, Vancouver, ISO, and other styles

19

Li, Peng, Lili Yin, Bo Zhao, and Yuezhongyi Sun. "Virtual Screening of Drug Proteins Based on Imbalance Data Mining." Mathematical Problems in Engineering 2021 (May 22, 2021): 1–10. http://dx.doi.org/10.1155/2021/5585990.

Full text

Abstract:

To address the imbalanced data problem in molecular docking-based virtual screening methods, this paper proposes a virtual screening method for drug proteins based on imbalanced data mining, which introduces machine learning technology into the virtual screening technology for drug proteins to deal with the imbalanced data problem in the virtual screening process and improve the accuracy of the virtual screening. First, to address the data imbalance problem caused by the large difference between the number of active compounds and the number of inactive compounds in the docking conformation generated by the actual virtual screening process, this paper proposes a way to improve the data imbalance problem using SMOTE combined with genetic algorithm to synthesize new active compounds artificially by upsampling active compounds. Then, in order to improve the accuracy in the virtual screening process of drug proteins, the idea of integrated learning is introduced, and the random forest (RF) extended from Bagging integrated learning technique is combined with the support vector machine (SVM) technique, and the virtual screening of molecular docking conformations using RF-SVM technique is proposed to improve the prediction accuracy of active compounds in docking conformations. To verify the effectiveness of the proposed technique, first, HIV-1 protease and SRC kinase were used as test data for the experiments, and then, CA II was used to validate the model of the test data. The virtual screening of drug proteins using the proposed method in this paper showed an improvement in both enrichment factor (EF) and AUC compared with the use of the traditional virtual screening, for the test dataset. Therefore, it can be shown that the proposed method can effectively improve the accuracy of drug virtual screening.

APA, Harvard, Vancouver, ISO, and other styles

20

Cohen, David, and Shannon Hughes. "How Do People Taking Psychiatric Drugs Explain Their “Chemical Imbalance?”." Ethical Human Psychology and Psychiatry 13, no. 3 (2012): 176–89. http://dx.doi.org/10.1891/1559-4343.13.3.176.

Full text

Abstract:

Many people believe that chemical imbalances cause mental illnesses, despite the absence of evidence to ascertain this. This study describes the reasoning that people use in their own case to justify this belief. Data come from recorded medication histories with 22 adults aged 23–68 years, taking different psychiatric drugs for various problems and varying durations, asked directly if they thought their problem was caused by a chemical imbalance and to explain their answer. About two-thirds expressed belief that they had a chemical imbalance; and the rest that they did not have one, did not or could not know, or that their medication had caused one. Reasoning backward from positive drug experiences (ex juvantibus or post hoc) and appeals to authority and convention characterized most answers expressing belief in an imbalance. Experiencing improvement while taking drugs and acquiescing in mental health practitioners’ views instills or reinforces people’s belief that they are or were chemically imbalanced, which suggests viewing the belief as a drug effect. The chemical imbalance notion is likely to persist, as its appeal to give personal meaning to symptom relief and its unfalsifiability ensure institutional support that neutralizes the absence of scientific support.

APA, Harvard, Vancouver, ISO, and other styles

21

Quan, Daying, Wei Feng, Gabriel Dauphin, Xiaofeng Wang, Wenjiang Huang, and Mengdao Xing. "A Novel Double Ensemble Algorithm for the Classification of Multi-Class Imbalanced Hyperspectral Data." Remote Sensing 14, no. 15 (August 5, 2022): 3765. http://dx.doi.org/10.3390/rs14153765.

Full text

Abstract:

The class imbalance problem has been reported to exist in remote sensing and hinders the classification performance of many machine learning algorithms. Several technologies, such as data sampling methods, feature selection-based methods, and ensemble-based methods, have been proposed to solve the class imbalance problem. However, these methods suffer from the loss of useful information or from artificial noise, or result in overfitting. A novel double ensemble algorithm is proposed to deal with the multi-class imbalance problem of the hyperspectral image in this paper. This method first computes the feature importance values of the hyperspectral data via an ensemble model, then produces several balanced data sets based on oversampling and builds a number of classifiers. Finally, the classification results of these diversity classifiers are combined according to a specific ensemble rule. In the experiment, different data-handling methods and classification methods including random undersampling (RUS), random oversampling (ROS), Adaboost, Bagging, and random forest are compared with the proposed double random forest method. The experimental results on three imbalanced hyperspectral data sets demonstrate the effectiveness of the proposed algorithm.

APA, Harvard, Vancouver, ISO, and other styles

22

Sainin, Mohd Shamrie, Rayner Alfred, and Faudziah Ahmad. "ENSEMBLE META CLASSIFIER WITH SAMPLING AND FEATURE SELECTION FOR DATA WITH IMBALANCE MULTICLASS PROBLEM." Journal of Information and Communication Technology 20, Number 2 (February 21, 2021): 103–33. http://dx.doi.org/10.32890/jict2021.20.2.1.

Full text

Abstract:

Ensemble learning by combining several single classifiers or another ensemble classifier is one of the procedures to solve the imbalance problem in multiclass data. However, this approach still faces the question of how the ensemble methods obtain their higher performance. In this paper, an investigation was carried out on the design of the meta classifier ensemble with sampling and feature selection for multiclass imbalanced data. The specific objectives were: 1) to improve the ensemble classifier through data-level approach (sampling and feature selection); 2) to perform experiments on sampling, feature selection, and ensemble classifier model; and 3 ) to evaluate t he performance of the ensemble classifier. To fulfil the objectives, a preliminary data collection of Malaysian plants’ leaf images was prepared and experimented, and the results were compared. The ensemble design was also tested with three other high imbalance ratio benchmark data. It was found that the design using sampling, feature selection, and ensemble classifier method via AdaboostM1 with random forest (also an ensemble classifier) provided improved performance throughout the investigation. The result of this study is important to the on-going problem of multiclass imbalance where specific structure and its performance can be improved in terms of processing time and accuracy.

APA, Harvard, Vancouver, ISO, and other styles

23

Li, Yan Ling, Kui Xia Han, and Ye Hang Zhu. "The Influence of Data Imbalance on Feature Selection." Advanced Materials Research 562-564 (August 2012): 1634–37. http://dx.doi.org/10.4028/www.scientific.net/amr.562-564.1634.

Full text

Abstract:

Data imbalance problem is urgent problem in data mining and machine learning fields, the standard classifier will tend to over-adapt to the large categories and ignore the small categories. According to this problem, this paper takes two categories of text classification problem as the background, respectively from the amount of text and the text length two perspectives, compare the impact of data distribution imbalance on different feature selection methods, and based on this we attempt to improve feature selection methods by set threshold according to category, through a series of experiments and results analysis, get some practical conclusions.

APA, Harvard, Vancouver, ISO, and other styles

24

Choudhary, Roshani, and Sanyam Shukla. "Reduced-Kernel Weighted Extreme Learning Machine Using Universum Data in Feature Space (RKWELM-UFS) to Handle Binary Class Imbalanced Dataset Classification." Symmetry 14, no. 2 (February 14, 2022): 379. http://dx.doi.org/10.3390/sym14020379.

Full text

Abstract:

Class imbalance is a phenomenon of asymmetry that degrades the performance of traditional classification algorithms such as the Support Vector Machine (SVM) and Extreme Learning Machine (ELM). Various modifications of SVM and ELM have been proposed to handle the class imbalance problem, which focus on different aspects to resolve the class imbalance. The Universum Support Vector Machine (USVM) incorporates the prior information in the classification model by adding Universum data to the training data to handle the class imbalance problem. Various other modifications of SVM have been proposed which use Universum data in the classification model generation. Moreover, the existing ELM-based classification models intended to handle class imbalance do not consider the prior information about the data distribution for training. An ELM-based classification model creates two symmetry planes, one for each class. The Universum-based ELM classification model tries to create a third plane between the two symmetric planes using Universum data. This paper proposes a novel hybrid framework called Reduced-Kernel Weighted Extreme Learning Machine Using Universum Data in Feature Space (RKWELM-UFS) to handle the classification of binary class-imbalanced problems. The proposed RKWELM-UFS combines the Universum learning method with a Reduced-Kernelized Weighted Extreme Learning Machine (RKWELM) for the first time to inherit the advantages of both techniques. To generate efficient Universum samples in the feature space, this work uses the kernel trick. The performance of the proposed method is evaluated using 44 benchmark binary class-imbalanced datasets. The proposed method is compared with 10 state-of-the-art classifiers using AUC and G-mean. The statistical t-test and Wilcoxon signed-rank test are used to quantify the performance enhancement of the proposed RKWELM-UFS compared to other evaluated classifiers.

APA, Harvard, Vancouver, ISO, and other styles

25

Kurniawati, Yulia Ery, and Yulius Denny Prabowo. "Model optimisation of class imbalanced learning using ensemble classifier on over-sampling data." IAES International Journal of Artificial Intelligence (IJ-AI) 11, no. 1 (March 1, 2022): 276. http://dx.doi.org/10.11591/ijai.v11.i1.pp276-283.

Full text

Abstract:

<span lang="EN-US">Data imbalance is one of the problems in the application of machine learning and data mining. Often this data imbalance occurs in the most essential and needed case entities. Two approaches to overcome this problem are the data level approach and the algorithm approach. This study aims to get the best model using the pap smear dataset that combined data levels with an algorithmic approach to solve data imbalanced. The laboratory data mostly have few data and imbalance. Almost in every case, the minor entities are the most important and needed. Over-sampling as a data level approach used in this study is the synthetic minority oversampling technique-nominal (SMOTE-N) and adaptive synthetic-nominal (ADASYN-N) algorithms. The algorithm approach used in this study is the ensemble classifier using AdaBoost and bagging with the classification and regression tree (CART) as learner-based. The best model obtained from the experimental results in accuracy, precision, recall, and f-measure using ADASYN-N and AdaBoost-CART.</span>

APA, Harvard, Vancouver, ISO, and other styles

26

Alahmari, Fahad. "A Comparison of Resampling Techniques for Medical Data Using Machine Learning." Journal of Information & Knowledge Management 19, no. 01 (March 2020): 2040016. http://dx.doi.org/10.1142/s021964922040016x.

Full text

Abstract:

Data imbalance with respect to the class labels has been recognised as a challenging problem for machine learning techniques as it has a direct impact on the classification model’s performance. In an imbalanced dataset, most of the instances belong to one class, while far fewer instances are associated with the remaining classes. Most of the machine learning algorithms tend to favour the majority class and ignore the minority classes leading to classification models being generated that cannot be generalised. This paper investigates the problem of class imbalance for a medical application related to autism spectrum disorder (ASD) screening to identify the ideal data resampling method that can stabilise classification performance. To achieve the aim, experimental analyses to measure the performance of different oversampling and under-sampling techniques have been conducted on a real imbalanced ASD dataset related to adults. The results produced by multiple classifiers on the considered datasets showed superiority in terms of specificity, sensitivity, and precision, among others, when adopting oversampling techniques in the pre-processing phase.

APA, Harvard, Vancouver, ISO, and other styles

27

Narwane, Swati V., and Sudhir D. Sawarkar. "Effects of Class Imbalance Using Machine Learning Algorithms." International Journal of Applied Evolutionary Computation 12, no. 1 (January 2021): 1–17. http://dx.doi.org/10.4018/ijaec.2021010101.

Full text

Abstract:

Class imbalance is the major hurdle for machine learning-based systems. Data set is the backbone of machine learning and must be studied to handle the class imbalance. The purpose of this paper is to investigate the effect of class imbalance on the data sets. The proposed methodology determines the model accuracy for class distribution. To find possible solutions, the behaviour of an imbalanced data set was investigated. The study considers two case studies with data set divided balanced to unbalanced class distribution. Testing of the data set with trained and test data was carried out for standard machine learning algorithms. Model accuracy for class distribution was measured with the training data set. Further, the built model was tested with individual binary class. Results show that, for the improvement of the system performance, it is essential to work on class imbalance problems. The study concludes that the system produces biased results due to the majority class. In the future, the multiclass imbalance problem can be studied using advanced algorithms.

APA, Harvard, Vancouver, ISO, and other styles

28

Guo, Huaping, Jun Zhou, Chang-an Wu, and Wei She. "A Novel Hybrid-Based Ensemble for Class Imbalance Problem." International Journal on Artificial Intelligence Tools 27, no. 06 (September 2018): 1850025. http://dx.doi.org/10.1142/s0218213018500252.

Full text

Abstract:

Class-imbalance is very common in real world. However, conventional advanced methods do not work well on imbalanced data due to imbalanced class distribution. This paper proposes a simple but effective Hybrid-based Ensemble (HE) to deal with two-class imbalanced problem. HE learns a hybrid ensemble using the following two stages: (1) learning several projection matrixes from the rebalanced data obtained by under-sampling the original training set and constructing new training sets by projecting the original training set to different spaces defined by the matrixes, and (2) undersampling several subsets from each new training set and training a model on each subset. Here, feature projection aims to improve the diversity between ensemble members and under-sampling technique is to improve generalization ability of individual members on minority class. Experimental results show that, compared with other state-of-the-art methods, HE shows significantly better performance on measures of AUC, G-mean, F-measure and recall.

APA, Harvard, Vancouver, ISO, and other styles

29

Wu, Xu, Youlong Yang, and Lingyu Ren. "Entropy difference and kernel-based oversampling technique for imbalanced data learning." Intelligent Data Analysis 24, no. 6 (December 18, 2020): 1239–55. http://dx.doi.org/10.3233/ida-194761.

Full text

Abstract:

Class imbalance is often a problem in various real-world datasets, where one class contains a small number of data and the other contains a large number of data. It is notably difficult to develop an effective model using traditional data mining and machine learning algorithms without using data preprocessing techniques to balance the dataset. Oversampling is often used as a pretreatment method for imbalanced datasets. Specifically, synthetic oversampling techniques focus on balancing the number of training instances between the majority class and the minority class by generating extra artificial minority class instances. However, the current oversampling techniques simply consider the imbalance of quantity and pay no attention to whether the distribution is balanced or not. Therefore, this paper proposes an entropy difference and kernel-based SMOTE (EDKS) which considers the imbalance degree of dataset from distribution by entropy difference and overcomes the limitation of SMOTE for nonlinear problems by oversampling in the feature space of support vector machine classifier. First, the EDKS method maps the input data into a feature space to increase the separability of the data. Then EDKS calculates the entropy difference in kernel space, determines the majority class and minority class, and finds the sparse regions in the minority class. Moreover, the proposed method balances the data distribution by synthesizing new instances and evaluating its retention capability. Our algorithm can effectively distinguish those datasets with the same imbalance ratio but different distribution. The experimental study evaluates and compares the performance of our method against state-of-the-art algorithms, and then demonstrates that the proposed approach is competitive with the state-of-art algorithms on multiple benchmark imbalanced datasets.

APA, Harvard, Vancouver, ISO, and other styles

30

Lin, Ismael, Octavio Loyola-González, Raúl Monroy, and Miguel Angel Medina-Pérez. "A Review of Fuzzy and Pattern-Based Approaches for Class Imbalance Problems." Applied Sciences 11, no. 14 (July 8, 2021): 6310. http://dx.doi.org/10.3390/app11146310.

Full text

Abstract:

The usage of imbalanced databases is a recurrent problem in real-world data such as medical diagnostic, fraud detection, and pattern recognition. Nevertheless, in class imbalance problems, the classifiers are commonly biased by the class with more objects (majority class) and ignore the class with fewer objects (minority class). There are different ways to solve the class imbalance problem, and there has been a trend towards the usage of patterns and fuzzy approaches due to the favorable results. In this paper, we provide an in-depth review of popular methods for imbalanced databases related to patterns and fuzzy approaches. The reviewed papers include classifiers, data preprocessing, and evaluation metrics. We identify different application domains and describe how the methods are used. Finally, we suggest further research directions according to the analysis of the reviewed papers and the trend of the state of the art.

APA, Harvard, Vancouver, ISO, and other styles

31

Lango, Mateusz. "Tackling the Problem of Class Imbalance in Multi-class Sentiment Classification: An Experimental Study." Foundations of Computing and Decision Sciences 44, no. 2 (June 1, 2019): 151–78. http://dx.doi.org/10.2478/fcds-2019-0009.

Full text

Abstract:

Abstract Sentiment classification is an important task which gained extensive attention both in academia and in industry. Many issues related to this task such as handling of negation or of sarcastic utterances were analyzed and accordingly addressed in previous works. However, the issue of class imbalance which often compromises the prediction capabilities of learning algorithms was scarcely studied. In this work, we aim to bridge the gap between imbalanced learning and sentiment analysis. An experimental study including twelve imbalanced learning preprocessing methods, four feature representations, and a dozen of datasets, is carried out in order to analyze the usefulness of imbalanced learning methods for sentiment classification. Moreover, the data difficulty factors — commonly studied in imbalanced learning — are investigated on sentiment corpora to evaluate the impact of class imbalance.

APA, Harvard, Vancouver, ISO, and other styles

32

Yan, Jianhong, and Suqing Han. "Classifying Imbalanced Data Sets by a Novel RE-Sample and Cost-Sensitive Stacked Generalization Method." Mathematical Problems in Engineering 2018 (2018): 1–13. http://dx.doi.org/10.1155/2018/5036710.

Full text

Abstract:

Learning with imbalanced data sets is considered as one of the key topics in machine learning community. Stacking ensemble is an efficient algorithm for normal balance data sets. However, stacking ensemble was seldom applied in imbalance data. In this paper, we proposed a novel RE-sample and Cost-Sensitive Stacked Generalization (RECSG) method based on 2-layer learning models. The first step is Level 0 model generalization including data preprocessing and base model training. The second step is Level 1 model generalization involving cost-sensitive classifier and logistic regression algorithm. In the learning phase, preprocessing techniques can be embedded in imbalance data learning methods. In the cost-sensitive algorithm, cost matrix is combined with both data characters and algorithms. In the RECSG method, ensemble algorithm is combined with imbalance data techniques. According to the experiment results obtained with 17 public imbalanced data sets, as indicated by various evaluation metrics (AUC, GeoMean, and AGeoMean), the proposed method showed the better classification performances than other ensemble and single algorithms. The proposed method is especially more efficient when the performance of base classifier is low. All these demonstrated that the proposed method could be applied in the class imbalance problem.

APA, Harvard, Vancouver, ISO, and other styles

33

Emamipour, Sajad, Rasoul Sali, and Zahra Yousefi. "A Multi-Objective Ensemble Method for Class Imbalance Learning." International Journal of Big Data and Analytics in Healthcare 2, no. 1 (January 2017): 16–34. http://dx.doi.org/10.4018/ijbdah.2017010102.

Full text

Abstract:

This article describes how class imbalance learning has attracted great attention in recent years as many real world domain applications suffer from this problem. Imbalanced class distribution occurs when the number of training examples for one class far surpasses the training examples of the other class often the one that is of more interest. This problem may produce an important deterioration of the classifier performance, in particular with patterns belonging to the less represented classes. Toward this end, the authors developed a hybrid model to address the class imbalance learning with focus on binary class problems. This model combines benefits of the ensemble classifiers with a multi objective feature selection technique to achieve higher classification performance. The authors' model also proposes non-dominated sets of features. Then they evaluate the performance of the proposed model by comparing its results with notable algorithms for solving imbalanced data problem. Finally, the authors utilize the proposed model in medical domain of predicting life expectancy in post-operative of thoracic surgery patients.

APA, Harvard, Vancouver, ISO, and other styles

34

Thamrin, Sri Astuti, Dian Sidik, Hedi Kuswanto, Armin Lawi, and Ansariadi Ansariadi. "Exploration of Obesity Status of Indonesia Basic Health Research 2013 With Synthetic Minority Over-Sampling Techniques." Indonesian Journal of Statistics and Its Applications 5, no. 1 (March 31, 2021): 75–91. http://dx.doi.org/10.29244/ijsa.v5i1p75-91.

Full text

Abstract:

The accuracy of the data class is very important in classification with a machine learning approach. The more accurate the existing data sets and classes, the better the output generated by machine learning. In fact, classification can experience imbalance class data in which each class does not have the same portion of the data set it has. The existence of data imbalance will affect the classification accuracy. One of the easiest ways to correct imbalanced data classes is to balance it. This study aims to explore the problem of data class imbalance in the medium case dataset and to address the imbalance of data classes as well. The Synthetic Minority Over-Sampling Technique (SMOTE) method is used to overcome the problem of class imbalance in obesity status in Indonesia 2013 Basic Health Research (RISKESDAS). The results show that the number of obese class (13.9%) and non-obese class (84.6%). This means that there is an imbalance in the data class with moderate criteria. Moreover, SMOTE with over-sampling 600% can improve the level of minor classes (obesity). As consequence, the classes of obesity status balanced. Therefore, SMOTE technique was better compared to without SMOTE in exploring the obesity status of Indonesia RISKESDAS 2013.

APA, Harvard, Vancouver, ISO, and other styles

35

Gupta, Reetu, Disha Gupta, and Prashant Khobragade. "A Clustering Approach for Class Imbalance Problem in Data Mining." International Journal of Computer Trends and Technology 42, no. 3 (December 25, 2016): 133–36. http://dx.doi.org/10.14445/22312803/ijctt-v42p122.

Full text

APA, Harvard, Vancouver, ISO, and other styles

36

BATUWITA, RUKSHAN, and VASILE PALADE. "ADJUSTED GEOMETRIC-MEAN: A NOVEL PERFORMANCE MEASURE FOR IMBALANCED BIOINFORMATICS DATASETS LEARNING." Journal of Bioinformatics and Computational Biology 10, no. 04 (July 23, 2012): 1250003. http://dx.doi.org/10.1142/s0219720012500035.

Full text

Abstract:

One common and challenging problem faced by many bioinformatics applications, such as promoter recognition, splice site prediction, RNA gene prediction, drug discovery and protein classification, is the imbalance of the available datasets. In most of these applications, the positive data examples are largely outnumbered by the negative data examples, which often leads to the development of sub-optimal prediction models having high negative recognition rate (Specificity = SP) and low positive recognition rate (Sensitivity = SE). When class imbalance learning methods are applied, usually, the SE is increased at the expense of reducing some amount of the SP. In this paper, we point out that in these data-imbalanced bioinformatics applications, the goal of applying class imbalance learning methods would be to increase the SE as high as possible by keeping the reduction of SP as low as possible. We explain that the existing performance measures used in class imbalance learning can still produce sub-optimal models with respect to this classification goal. In order to overcome these problems, we introduce a new performance measure called Adjusted Geometric-mean (AGm). The experimental results obtained on ten real-world imbalanced bioinformatics datasets demonstrates that the AGm metric can achieve a lower rate of reduction of SP than the existing performance metrics, when increasing the SE through class imbalance learning methods. This characteristic of AGm metric makes it more suitable for achieving the proposed classification goal in imbalanced bioinformatics datasets learning.

APA, Harvard, Vancouver, ISO, and other styles

37

Setiawan, Budi Darma, Uwe Serdült, and Victor Kryssanov. "A Machine Learning Framework for Balancing Training Sets of Sensor Sequential Data Streams." Sensors 21, no. 20 (October 18, 2021): 6892. http://dx.doi.org/10.3390/s21206892.

Full text

Abstract:

The recent explosive growth in the number of smart technologies relying on data collected from sensors and processed with machine learning classifiers made the training data imbalance problem more visible than ever before. Class-imbalanced sets used to train models of various events of interest are among the main reasons for a smart technology to work incorrectly or even to completely fail. This paper presents an attempt to resolve the imbalance problem in sensor sequential (time-series) data through training data augmentation. An Unrolled Generative Adversarial Networks (Unrolled GAN)-powered framework is developed and successfully used to balance the training data of smartphone accelerometer and gyroscope sensors in different contexts of road surface monitoring. Experiments with other sensor data from an open data collection are also conducted. It is demonstrated that the proposed approach allows for improving the classification performance in the case of heavily imbalanced data (the F1 score increased from 0.69 to 0.72, p<0.01, in the presented case study). However, the effect is negligible in the case of slightly imbalanced or inadequate training sets. The latter determines the limitations of this study that would be resolved in future work aimed at incorporating mechanisms for assessing the training data quality into the proposed framework and improving its computational efficiency.

APA, Harvard, Vancouver, ISO, and other styles

38

Rekha, Gillala, and V. Krishna Reddy. "A Novel Approach for Handling Outliers in Imbalanced Data." International Journal of Engineering & Technology 7, no. 3.1 (August 4, 2018): 1. http://dx.doi.org/10.14419/ijet.v7i3.1.16783.

Full text

Abstract:

Most of the traditional classification algorithms assume their training data to be well-balanced in terms of class distribution. Real-world datasets, however, are imbalanced in nature thus degrade the performance of the traditional classifiers. To solve this problem, many strategies are adopted to balance the class distribution at the data level. The data level methods balance the imbalance distribution between majority and minority classes using either oversampling or under sampling techniques. The main concern of this paper is to remove the outliers that may generate while using oversampling techniques. In this study, we proposed a novel approach for solving the class imbalance problem at data level by using modified SMOTE to remove the outliers that may exist after synthetic data generation using SMOTE oversampling technique. We extensively compare our approach with SMOTE, SMOTE+ENN, SMOTE+Tomek-Link using 9 datasets from keel repository using classification algorithms. The result reveals that our approach improves the prediction performance for most of the classification algorithms and achieves better performance compared to the existing approaches.

APA, Harvard, Vancouver, ISO, and other styles

39

WANG, SHUO, LEANDRO L. MINKU, and XIN YAO. "ONLINE CLASS IMBALANCE LEARNING AND ITS APPLICATIONS IN FAULT DETECTION." International Journal of Computational Intelligence and Applications 12, no. 04 (December 2013): 1340001. http://dx.doi.org/10.1142/s1469026813400014.

Full text

Abstract:

Although class imbalance learning and online learning have been extensively studied in the literature separately, online class imbalance learning that considers the challenges of both fields has not drawn much attention. It deals with data streams having very skewed class distributions, such as fault diagnosis of real-time control monitoring systems and intrusion detection in computer networks. To fill in this research gap and contribute to a wide range of real-world applications, this paper first formulates online class imbalance learning problems. Based on the problem formulation, a new online learning algorithm, sampling-based online bagging (SOB), is proposed to tackle class imbalance adaptively. Then, we study how SOB and other state-of-the-art methods can benefit a class of fault detection data under various scenarios and analyze their performance in depth. Through extensive experiments, we find that SOB can balance the performance between classes very well across different data domains and produce stable G-mean when learning constantly imbalanced data streams, but it is sensitive to sudden changes in class imbalance, in which case SOB's predecessor undersampling-based online bagging (UOB) is more robust.

APA, Harvard, Vancouver, ISO, and other styles

40

Agrawal, Divya, and Padma Bonde. "Improving Classification Accuracy on Imbalanced Data by Ensembling Technique." Journal of Cases on Information Technology 19, no. 1 (January 2017): 42–49. http://dx.doi.org/10.4018/jcit.2017010104.

Full text

Abstract:

Prediction using classification techniques is one of the fundamental feature widely applied in various fields. Classification accuracy is still a great challenge due to data imbalance problem. The increased volume of data is also posing a challenge for data handling and prediction, particularly when technology is used as the interface between customers and the company. As the data imbalance increases it directly affects the classification accuracy of the entire system. AUC (area under the curve) and lift proved to be good evaluation metrics. Classification techniques help to improve classification accuracy, but in case of imbalanced dataset classification accuracy does not predict well and other techniques, such as oversampling needs to be resorted. Paper presented Voting based ensembling technique to improve classification accuracy in case of imbalanced data. The voting based ensemble is based on taking the votes on the best class obtained by the three classification techniques, namely, Logistics Regression, Classification Trees and Discriminant Analysis. The observed result revealed improvement in classification accuracy by using voting ensembling technique.

APA, Harvard, Vancouver, ISO, and other styles

41

Xie, Yi Ning, Lian Yu, Guo Hui Guan, and Yong Jun He. "An Overlapping Cell Image Synthesis Method for Imbalance Data." Analytical Cellular Pathology 2018 (July 9, 2018): 1–12. http://dx.doi.org/10.1155/2018/7919503.

Full text

Abstract:

DNA ploidy analysis of cells is an automation technique applied in pathological diagnosis. It is important for this technique to classify various nuclei images accurately. However, the lack of overlapping nuclei images in training data (imbalanced training data) results in low recognition rates of overlapping nuclei images. To solve this problem, a new method which synthesizes overlapping nuclei images with single-nuclei images is proposed. Firstly, sample selection is employed to make the synthesized samples representative. Secondly, random functions are used to control the rotation angles of the nucleus and the distance between the centroids of the nucleus, increasing the sample diversity. Then, the Lambert-Beer law is applied to reassign the pixels of overlapping parts, thus making the synthesized samples quite close to the real ones. Finally, all synthesized samples are added to the training sets for classifier training. The experimental results show that images synthesized by this method can solve the data set imbalance problem and improve the recognition rate of DNA ploidy analysis systems.

APA, Harvard, Vancouver, ISO, and other styles

42

Qing, Zhipeng, Qiangyu Zeng, Hao Wang, Yin Liu, Taisong Xiong, and Shihao Zhang. "ADASYN-LOF Algorithm for Imbalanced Tornado Samples." Atmosphere 13, no. 4 (March 29, 2022): 544. http://dx.doi.org/10.3390/atmos13040544.

Full text

Abstract:

Early warning and forecasting of tornadoes began to combine artificial intelligence (AI) and machine learning (ML) algorithms to improve identification efficiency in the past few years. Applying machine learning algorithms to detect tornadoes usually encounters class imbalance problems because tornadoes are rare events in weather processes. The ADASYN-LOF algorithm (ALA) was proposed to solve the imbalance problem of tornado sample sets based on radar data. The adaptive synthetic (ADASYN) sampling algorithm is used to solve the imbalance problem by increasing the number of minority class samples, combined with the local outlier factor (LOF) algorithm to denoise the synthetic samples. The performance of the ALA algorithm is tested by using the supporting vector machine (SVM), artificial neural network (ANN), and random forest (RF) models. The results show that the ALA algorithm can improve the performance and noise immunity of the models, significantly increase the tornado recognition rate, and have the potential to increase the early tornado warning time. ALA is more effective in preprocessing imbalanced data of SVM and ANN, compared with ADASYN, Synthetic Minority Oversampling Technique (SMOTE), SMOTE-LOF algorithms.

APA, Harvard, Vancouver, ISO, and other styles

43

Cho, Eunnuri, Tai-Woo Chang, and Gyusun Hwang. "Data Preprocessing Combination to Improve the Performance of Quality Classification in the Manufacturing Process." Electronics 11, no. 3 (February 6, 2022): 477. http://dx.doi.org/10.3390/electronics11030477.

Full text

Abstract:

The recent introduction of smart manufacturing, also called the ‘smart factory’, has made it possible to collect a significant number of multi-variate data from Internet of Things devices or sensors. Quality control using these data in the manufacturing process can play a major role in preventing unexpected time and economic losses. However, the extraction of information about the manufacturing process is limited when there are missing values in the data and a data imbalance set. In this study, we improve the quality classification performance by solving the problem of missing values and data imbalances that can occur in the manufacturing process. This study proceeds with data cleansing, data substitution, data scaling, a data balancing model methodology, and evaluation. Five data balancing methods and a generative adversarial network (GAN) were used to proceed with data imbalance processing. The proposed schemes achieved an F1 score that was 0.5 higher than the F1 score of previous studies that used the same data. The data preprocessing combination proposed in this study is intended to be used to solve the problem of missing values and imbalances that occur in the manufacturing process.

APA, Harvard, Vancouver, ISO, and other styles

44

Palli, Abdul Sattar, Jafreezal Jaafar, Heitor Murilo Gomes, Manzoor Ahmed Hashmani, and Abdul Rehman Gilal. "An Experimental Analysis of Drift Detection Methods on Multi-Class Imbalanced Data Streams." Applied Sciences 12, no. 22 (November 17, 2022): 11688. http://dx.doi.org/10.3390/app122211688.

Full text

Abstract:

The performance of machine learning models diminishes while predicting the Remaining Useful Life (RUL) of the equipment or fault prediction due to the issue of concept drift. This issue is aggravated when the problem setting comprises multi-class imbalanced data. The existing drift detection methods are designed to detect certain drifts in specific scenarios. For example, the drift detector designed for binary class data may not produce satisfactory results for applications that generate multi-class data. Similarly, the drift detection method designed for the detection of sudden drift may struggle with detecting incremental drift. Therefore, in this experimental investigation, we seek to investigate the performance of the existing drift detection methods on multi-class imbalanced data streams with different drift types. For this reason, this study simulated the streams with various forms of concept drift and the multi-class imbalance problem to test the existing drift detection methods. The findings of current study will aid in the selection of drift detection methods for use in developing solutions for real-time industrial applications that encounter similar issues. The results revealed that among the compared methods, DDM produced the best average F1 score. The results also indicate that the multi-class imbalance causes the false alarm rate to increase for most of the drift detection methods.

APA, Harvard, Vancouver, ISO, and other styles

45

Sun, Yue, Aidong Xu, Kai Wang, Xiufang Zhou, Haifeng Guo, and Xiaojia Han. "Intelligent Fault Diagnosis of Industrial Robot Based on Multiclass Mahalanobis-Taguchi System for Imbalanced Data." Entropy 24, no. 7 (June 24, 2022): 871. http://dx.doi.org/10.3390/e24070871.

Full text

Abstract:

One of the biggest challenges for the fault diagnosis research of industrial robots is that the normal data is far more than the fault data; that is, the data is imbalanced. The traditional diagnosis approaches of industrial robots are more biased toward the majority categories, which makes the diagnosis accuracy of the minority categories decrease. To solve the imbalanced problem, the traditional algorithm is improved by using cost-sensitive learning, single-class learning and other approaches. However, these algorithms also have a series of problems. For instance, it is difficult to estimate the true misclassification cost, overfitting, and long computation time. Therefore, a fault diagnosis approach for industrial robots, based on the Multiclass Mahalanobis-Taguchi system (MMTS), is proposed in this article. It can be classified the categories by measuring the deviation degree from the sample to the reference space, which is more suitable for classifying imbalanced data. The accuracy, G-mean and F-measure are used to verify the effectiveness of the proposed approach on an industrial robot platform. The experimental results show that the proposed approach’s accuracy, F-measure and G-mean improves by an average of 20.74%, 12.85% and 21.68%, compared with the other five traditional approaches when the imbalance ratio is 9. With the increase in the imbalance ratio, the proposed approach has better stability than the traditional algorithms.

APA, Harvard, Vancouver, ISO, and other styles

46

Hemalatha, Putta, and Geetha Mary Amalanathan. "FG-SMOTE: Fuzzy-based Gaussian synthetic minority oversampling with deep belief networks classifier for skewed class distribution." International Journal of Intelligent Computing and Cybernetics 14, no. 2 (March 15, 2021): 270–87. http://dx.doi.org/10.1108/ijicc-12-2020-0202.

Full text

Abstract:

PurposeAdequate resources for learning and training the data are an important constraint to develop an efficient classifier with outstanding performance. The data usually follows a biased distribution of classes that reflects an unequal distribution of classes within a dataset. This issue is known as the imbalance problem, which is one of the most common issues occurring in real-time applications. Learning of imbalanced datasets is a ubiquitous challenge in the field of data mining. Imbalanced data degrades the performance of the classifier by producing inaccurate results.Design/methodology/approachIn the proposed work, a novel fuzzy-based Gaussian synthetic minority oversampling (FG-SMOTE) algorithm is proposed to process the imbalanced data. The mechanism of the Gaussian SMOTE technique is based on finding the nearest neighbour concept to balance the ratio between minority and majority class datasets. The ratio of the datasets belonging to the minority and majority class is balanced using a fuzzy-based Levenshtein distance measure technique.FindingsThe performance and the accuracy of the proposed algorithm is evaluated using the deep belief networks classifier and the results showed the efficiency of the fuzzy-based Gaussian SMOTE technique achieved an AUC: 93.7%. F1 Score Prediction: 94.2%, Geometric Mean Score: 93.6% predicted from confusion matrix.Research limitations/implicationsThe proposed research still retains some of the challenges that need to be focused such as application FG-SMOTE to multiclass imbalanced dataset and to evaluate dataset imbalance problem in a distributed environment.Originality/valueThe proposed algorithm fundamentally solves the data imbalance issues and challenges involved in handling the imbalanced data. FG-SMOTE has aided in balancing minority and majority class datasets.

APA, Harvard, Vancouver, ISO, and other styles

47

Fu, Guang-Hui, Jia-Bao Wang, Min-Jie Zong, and Lun-Zhao Yi. "Feature Ranking and Screening for Class-Imbalanced Metabolomics Data Based on Rank Aggregation Coupled with Re-Balance." Metabolites 11, no. 6 (June 14, 2021): 389. http://dx.doi.org/10.3390/metabo11060389.

Full text

Abstract:

Feature screening is an important and challenging topic in current class-imbalance learning. Most of the existing feature screening algorithms in class-imbalance learning are based on filtering techniques. However, the variable rankings obtained by various filtering techniques are generally different, and this inconsistency among different variable ranking methods is usually ignored in practice. To address this problem, we propose a simple strategy called rank aggregation with re-balance (RAR) for finding key variables from class-imbalanced data. RAR fuses each rank to generate a synthetic rank that takes every ranking into account. The class-imbalanced data are modified via different re-sampling procedures, and RAR is performed in this balanced situation. Five class-imbalanced real datasets and their re-balanced ones are employed to test the RAR’s performance, and RAR is compared with several popular feature screening methods. The result shows that RAR is highly competitive and almost better than single filtering screening in terms of several assessing metrics. Performing re-balanced pretreatment is hugely effective in rank aggregation when the data are class-imbalanced.

APA, Harvard, Vancouver, ISO, and other styles

48

Untoro, Meida Cahyo, and Joko Lianto Buliali. "Penanganan imbalance class data laboratorium kesehatan dengan Majority Weighted Minority Oversampling Technique." Register: Jurnal Ilmiah Teknologi Sistem Informasi 4, no. 1 (November 24, 2018): 23. http://dx.doi.org/10.26594/register.v4i1.1184.

Full text

Abstract:

Diagnosis suatu penyakit akan menjadi tepat jika didukung dengan berbagai proses mulai pengecekan awal (amannesa) sampai pengecekan laboratorium. Hasil dari proses laboratorium mempunyai informasi berbagai penyakit, akan tetapi beberapa jenis penyakit memiliki prevalensi rendah. Penyakit bervalensi rendah memiliki pengaruh dalam penanganan pasien lebih lanjut. Dengan rasio yang tidak seimbang data laboratorium akan menyebabkan nilai akurasi menjadi rendah dalam pengklasifikasian dan penanganan penyakit. Majority Weighted Minority Oversampling Technique (MWMOTE) adalah saalah satu cara untuk menyelesaikan imbalanced. Penelitian ini bertujuan menangani permasalahan ketidakseimbangan data laboratorium kesehatan sehingga diperoleh hasil pengklasifikasian penyakit dengan tingkat akurasi lebih tinggi. Hasil pada penelitian ini menunjukkan bahwa MWMOTE dapat meningkatkan akurasi untuk permasalahan ketidakseimbangan data sebesar 3,13%. Diagnosis of a disease will be appropriate if supported by various processes ranging from initial checks (amannesa) to laboratory checks. Results from the laboratory process have information on various diseases, but some types of diseases have a low prevalence. Low-valvature disease has an effect in the treatment of the patient further. With an unbalanced ratio the laboratory data will cause the accuracy value to be low in the classification and handling of the disease. Majority Weighted Minority Oversampling Technique (MWMOTE) is one way to complete imbalanced. This study aims to address the problem of imbalance of health laboratory data to obtain the results of the classification of disease with a higher degree of accuracy. The results of this study indicate that MWMOTE can improve accuracy for data imbalance problems by 3.13%.

APA, Harvard, Vancouver, ISO, and other styles

49

Hussin, Sahar K., Salah M. Abdelmageid, Adel Alkhalil, Yasser M. Omar, Mahmoud I. Marie, and Rabie A. Ramadan. "Handling Imbalance Classification Virtual Screening Big Data Using Machine Learning Algorithms." Complexity 2021 (January 28, 2021): 1–15. http://dx.doi.org/10.1155/2021/6675279.

Full text

Abstract:

Virtual screening is the most critical process in drug discovery, and it relies on machine learning to facilitate the screening process. It enables the discovery of molecules that bind to a specific protein to form a drug. Despite its benefits, virtual screening generates enormous data and suffers from drawbacks such as high dimensions and imbalance. This paper tackles data imbalance and aims to improve virtual screening accuracy, especially for a minority dataset. For a dataset identified without considering the data’s imbalanced nature, most classification methods tend to have high predictive accuracy for the majority category. However, the accuracy was significantly poor for the minority category. The paper proposes a K-mean algorithm coupled with Synthetic Minority Oversampling Technique (SMOTE) to overcome the problem of imbalanced datasets. The proposed algorithm is named as KSMOTE. Using KSMOTE, minority data can be identified at high accuracy and can be detected at high precision. A large set of experiments were implemented on Apache Spark using numeric PaDEL and fingerprint descriptors. The proposed solution was compared to both no-sampling method and SMOTE on the same datasets. Experimental results showed that the proposed solution outperformed other methods.

APA, Harvard, Vancouver, ISO, and other styles

50

Yu, Hualong, Shufang Hong, Xibei Yang, Jun Ni, Yuanyuan Dan, and Bin Qin. "Recognition of Multiple Imbalanced Cancer Types Based on DNA Microarray Data Using Ensemble Classifiers." BioMed Research International 2013 (2013): 1–13. http://dx.doi.org/10.1155/2013/239628.

Full text

Abstract:

DNA microarray technology can measure the activities of tens of thousands of genes simultaneously, which provides an efficient way to diagnose cancer at the molecular level. Although this strategy has attracted significant research attention, most studies neglect an important problem, namely, that most DNA microarray datasets are skewed, which causes traditional learning algorithms to produce inaccurate results. Some studies have considered this problem, yet they merely focus on binary-class problem. In this paper, we dealt with multiclass imbalanced classification problem, as encountered in cancer DNA microarray, by using ensemble learning. We utilized one-against-all coding strategy to transform multiclass to multiple binary classes, each of them carrying out feature subspace, which is an evolving version of random subspace that generates multiple diverse training subsets. Next, we introduced one of two different correction technologies, namely, decision threshold adjustment or random undersampling, into each training subset to alleviate the damage of class imbalance. Specifically, support vector machine was used as base classifier, and a novel voting rule called counter voting was presented for making a final decision. Experimental results on eight skewed multiclass cancer microarray datasets indicate that unlike many traditional classification approaches, our methods are insensitive to class imbalance.

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!