Дисертації з теми "Data imbalance problem"
Оформте джерело за APA, MLA, Chicago, Harvard та іншими стилями
Ознайомтеся з топ-17 дисертацій для дослідження на тему "Data imbalance problem".
Біля кожної праці в переліку літератури доступна кнопка «Додати до бібліографії». Скористайтеся нею – і ми автоматично оформимо бібліографічне посилання на обрану працю в потрібному вам стилі цитування: APA, MLA, «Гарвард», «Чикаго», «Ванкувер» тощо.
Також ви можете завантажити повний текст наукової публікації у форматі «.pdf» та прочитати онлайн анотацію до роботи, якщо відповідні параметри наявні в метаданих.
Переглядайте дисертації для різних дисциплін та оформлюйте правильно вашу бібліографію.
Gao, Jie. "Data Augmentation in Solving Data Imbalance Problems." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-289208.
Повний текст джерелаDetta projekt fokuserar huvudsakligen på de olika metoderna för att lösa dataobalansproblem i fältet Natural Language Processing (NLP). Obalanserad textdata är ett vanligt problem i många uppgifter, särskilt klassificeringsuppgiften, vilket leder till att modellen inte kan förutsäga minoriteten Ibland kan vi till och med byta till en mer utmärkt och komplicerad modell inte förbättra prestandan, medan några enkla datastrategier som fokuserar på att lösa data obalanserade problem som överprov eller nedprovning ger positiva effekter på resultatet. vanliga datastrategier inkluderar några omprovningsmetoder som duplicerar nya data från originaldata eller tar bort originaldata för att få balans. Förutom det används vissa andra metoder som ordbyte, ordbyte och radering av ord i tidigare arbete Samtidigt har vissa djupinlärningsmodeller som BERT, GPT och fastText-modellen, som har en stark förmåga till en allmän förståelse av naturliga språk, så vi väljer några av dem för att lösa problemet med obalans i data. Det finns dock ingen systematisk jämförelse när man praktiserar dessa metoder. Exempelvis är överprovtagning och nedprovtagning snabba och enkla att använda i tidigare små skalor av datamängder. Med ökningen av datauppsättningen är de nya genererade data från vissa djupa nätverksmodeller mer kompatibla med originaldata. Därför fokuserar vårt arbete på hur prestandan för olika dataförstärkningstekniker används när de används för att lösa dataobalansproblem, givet datamängden och uppgiften? Efter experimentet visar både kvalitativa och kvantitativa experimentella resultat att olika metoder har sina fördelar för olika datamängder. I allmänhet kan dataförstärkning förbättra prestandan hos klassificeringsmodeller. För specifika, BERT speciellt vår finjusterade BERT har en utmärkt förmåga i de flesta med hjälp av scenarier (olika skalor och typer av datamängden). Ändå har andra tekniker som Back-translation bättre prestanda i lång textdata, till och med det kostar mer tid och har en komplicerad modell. Sammanfattningsvis lämpliga val för metoder för dataökning kan hjälpa till att lösa problem med obalans i data.
Barella, Victor Hugo. "Técnicas para o problema de dados desbalanceados em classificação hierárquica." Universidade de São Paulo, 2015. http://www.teses.usp.br/teses/disponiveis/55/55134/tde-06012016-145045/.
Повний текст джерелаRecent advances in science and technology have made possible the data growth in quantity and availability. Along with this explosion of generated information, there is a need to analyze data to discover new and useful knowledge. Thus, areas for extracting knowledge and useful information in large datasets have become great opportunities for the advancement of research, such as Machine Learning (ML) and Data Mining (DM). However, there are some limitations that may reduce the accuracy of some traditional algorithms of these areas, for example the imbalance of classes samples in a dataset. To mitigate this drawback, some solutions have been the target of research in recent years, such as the development of techniques for artificial balancing data, algorithm modification and new approaches for imbalanced data. An area little explored in the data imbalance vision are the problems of hierarchical classification, in which the classes are organized into hierarchies, commonly in the form of tree or DAG (Direct Acyclic Graph). The goal of this work aims at investigating the limitations and approaches to minimize the effects of imbalanced data with hierarchical classification problems. The experimental results show the need to take into account the features of hierarchical classes when deciding the application of techniques for imbalanced data in hierarchical classification.
Gao, Ming. "A study on imbalanced data classification problems." Thesis, University of Reading, 2013. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.602707.
Повний текст джерелаJeatrakul, Piyasak. "Enhancing classification performance over noise and imbalanced data problems." Thesis, Jeatrakul, Piyasak (2012) Enhancing classification performance over noise and imbalanced data problems. PhD thesis, Murdoch University, 2012. https://researchrepository.murdoch.edu.au/id/eprint/10044/.
Повний текст джерелаPan, Yi-Ying, and 潘怡瑩. "Clustering-based Data Preprocessing Approach for the Class Imbalance Problem." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/94nys8.
Повний текст джерела國立中央大學
資訊管理學系
106
The class imbalance problem is an important issue in data mining. It occurs when the number of samples in one class is much larger than the other classes. Traditional classifiers tend to misclassify most samples of the minority class into the majority class for maximizing the overall accuracy. This phenomenon makes it hard to establish a good classification rule for the minority class. The class imbalance problem often occurs in many real world applications, such as fault diagnosis, medical diagnosis and face recognition. To deal with the class imbalance problem, a clustering-based data preprocessing approach is proposed, where two different clustering techniques including affinity propagation clustering and K-means clustering are used individually to divide the majority class into several subclasses resulting in multiclass data. This approach can effectively reduce the class imbalance ratio of the training dataset, shorten the class training time and improve classification performance. Our experiments based on forty-four small class imbalance datasets from KEEL and eight high-dimensional datasets from NASA to build five types of classification models, which are C4.5, MLP, Naïve Bayes, SVM and k-NN (k=5). In addition, we also employ the classifier ensemble algorithm. This research tries to compare AUC results between different clustering techniques, different classification models and the number of clusters of K-means clustering in order to find out the best configuration of the proposed approach and compare with other literature methods. Finally, the experimental results of the KEEL datasets show that k-NN (k=5) algorithm is the best choice regardless of whether affinity propagation or K-means (K=5); the experimental results of NASA datasets show that the performance of the proposed approach is superior to the literature methods for the high-dimensional datasets.
Komba, Lyee, and Lyee Komba. "Sampling Techniques for Class Imbalance Problem in Aviation Safety Incidents Data." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/jg2y52.
Повний текст джерела國立臺北科技大學
電資國際專班
106
Like any other industries in the world, the aviation industry has a variety data acquired everyday through numerous data management systems. Structured and unstructured data are being collected through aircraft systems, maintenance systems, supply systems, ticketing and booking systems, and many other systems that are utilized in the daily operations of aviation business. Data mining can be used to analyze all these different types of data to generate meaningful information that can improve future performance, safety and profitability for aviation business and operations. This thesis presents details of data mining methods based on aviation incident data to predict incidents with fatal or a death consequence. Other literature have applied data mining techniques within the aviation industry include prediction of passenger travel, meteorological prediction, component failure prediction and other fatal incident prediction literature that aimed at finding the right features. This study uses the public dataset from the Federal Aviation Authority Accidents and Incidents Data System (FAA AIDS) website – data records from the year 2000 to year 2017. Our goal is to build a prediction model for fatal incidents and generate decision rules or factors contributing to incidents that have fatal results. In this way, the model to be built will be a predictive risk management system for aviation safety. The aviation industry generally operates at a safe state because of the transition from reactive safety and risk management to a proactive safety management approach; and now a predictive approach to safety management with the application of data mining techniques such as from this study and others. Over time, the number of systems has increased and the number of aviation accidents and serious incidents has decreased. Hence, a 0.6% of incidents with fatal consequences was attained from our analysis. During the data preprocessing stage, a problem of unbalanced dataset is encountered that invokes us to propose some techniques to solve the issue. Unbalanced datasets are datasets where least number of data is representing the minority classes than the majority class, especially when the analysis is aimed at the minority class. Not dealing with this issue correctly may result in poor performing models or misclassified data. With the increase of the travelling population in the aviation community, safety is paramount so coming up with a relatively precise model is important. In order to come up with a precise model/classifier, we need to preprocess and resample the data efficiently. This thesis also looks at combating the issue of unbalanced data to come up with a balanced data that can be used to train a classifier to design a precise model. We applied the following sampling technique in R Studio– oversampling, under-sampling, SMOTE and bootstrap samples to solve the imbalanced data. The resulting dataset from the unbalanced dataset resolution techniques are used to train different classifiers and the performance of the classifiers are measured and discussed in this thesis.
Yao, Guan-Ting, and 姚冠廷. "A Two-Stage Hybrid Data Preprocessing Approach for the Class Imbalance Problem." Thesis, 2017. http://ndltd.ncl.edu.tw/handle/dm48kk.
Повний текст джерела國立中央大學
資訊管理學系
105
The class imbalance problem is an important issue in data mining. The class skewed distribution occurs when the number of examples that represent one class is much lower than the ones of the other classes. The traditional classifiers tend to misclassify most samples in the minority class into the majority class because of maximizing the overall accuracy. This phenomenon limits the construction of effective classifiers for the precious minority class. This problem occurs in many real-world applications, such as fault diagnosis, medical diagnosis and face recognition. To deal with the class imbalance problem, I proposed a two-stage hybrid data preprocessing framework based on clustering and instance selection techniques. This approach filters out the noisy data in the majority class and can reduce the execution time for classifier training. More importantly, it can decrease the effect of class imbalance and perform very well in the classification task. Our experiments using 44 class imbalance datasets from KEEL to build four types of classification models, which are C4.5, k-NN, Naïve Bayes and MLP. In addition, the classifier ensemble algorithm is also employed. In addition, two kinds of clustering techniques and three kinds of instance selection algorithms are used in order to find out the best combination suited for the proposed method. The experimental results show that the proposed framework performs better than many well-known state-of-the-art approaches in terms of AUC. In particular, the proposed framework combined with bagging based MLP ensemble classifiers perform the best, which provide 92% of AUC.
吳思翰. "Combine Particle Swarm Optimization and Mahalonobis-Taguchi System for Solving Classification Problem in Imbalance Data." Thesis, 2010. http://ndltd.ncl.edu.tw/handle/06887158161687794935.
Повний текст джерелаChang, Yu-shan, and 張毓珊. "Developing Data Mining Models for Class Imbalance Problems." Thesis, 2009. http://ndltd.ncl.edu.tw/handle/57781951199735409394.
Повний текст джерела朝陽科技大學
資訊管理系碩士班
98
In classification problems, the class imbalance problem would cause a bias on the training of classifiers and result in a low predictive accuracy over the minority class examples. This problem is caused by imbalanced data in which almost all examples belong to one class and far fewer instances belong to others. Compared with the majority examples, the minority examples are usually more interesting class, such as rare diseases in medical diagnosis data, failures in inspection data, frauds in credit screening data, and so on. When inducing knowledge from an imbalanced data set, traditional data mining algorithms will seek high classification accuracy for the majority class, but an unacceptable error rate for the minority class. Therefore, they are not suitable for handling the class imbalanced data. In order to tackle the class imbalance problem, this study aims to (1) find a robust classifier from different candidates including Decision Tree (DT), Logistic Regression (LR), Mahalanobis Distance (MD), and Support Vector Machines (SVM); (2) propose two novel methods called MD-SVM (a new two-phase learning scheme) and SWAI (SOM Weights As Input). Experimental results indicated our proposed MD-SVM and SWAI has better performance in identifying the minority class examples compared with traditional techniques such as under-sampling, cost adjusting, and cluster based sampling.
Liu, Yi-Hsun, and 劉奕勛. "Deep Discriminative Features Learning and Sampling for Imbalanced Data Problem." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/3cc7k8.
Повний текст джерела國立交通大學
資訊科學與工程研究所
106
The imbalanced data problem occurs in many application domains and is considered to be a challenging problem in machine learning and data mining. Oversampling may lead to overfitting, while undersampling may discard representative data samples. Additionally, most resampling methods for synthetic data focus on minority class without considering the data distribution of major classes. This paper presents an algorithm that combines feature embedding with the loss functions from discriminative feature learning in deep learning to generate synthetic data samples. In contrast to previous works, the proposed method considers both majority classes and minority classes to learn feature embeddings and utilizes appropriate loss functions to make feature embedding as discriminative as possible. The proposed method is a comprehensive framework and different feature extractors can be utilized for different domains. We conduct experiments utilizing eight numerical datasets and one image dataset based on multiclass classification tasks. The experimental results indicate that the proposed method provides accurate and stable results. Additionally, we thoroughly investigate the proposed method and utilize a visualization technique to determine why the proposed method can generate good data samples.
Cieslak, David A. "Finding problems in, proposing solutions to, and performing analysis on imbalanced data." 2009. http://etd.nd.edu/ETD-db/theses/available/etd-07082009-100035/.
Повний текст джерелаThesis directed by Nitesh Chawla for the Department of Computer Science and Engineering. "July 2009." Includes bibliographical references (leaves 230-241).
"BagStack Classification for Data Imbalance Problems with Application to Defect Detection and Labeling in Semiconductor Units." Doctoral diss., 2019. http://hdl.handle.net/2286/R.I.53957.
Повний текст джерелаDissertation/Thesis
Doctoral Dissertation Computer Engineering 2019
Esteves, Vitor Miguel Saraiva. "Techniques to deal with imbalanced data in multi-class problems: A review of existing methods." Master's thesis, 2020. https://hdl.handle.net/10216/126820.
Повний текст джерелаEsteves, Vitor Miguel Saraiva. "Techniques to deal with imbalanced data in multi-class problems: A review of existing methods." Dissertação, 2020. https://hdl.handle.net/10216/126820.
Повний текст джерелаSU, PO-YU, and 蘇柏瑜. "Integrating Clustering Analysis with Granular Computing for Imbalanced Data Classification Problem─A Case Study on Prostate Cancer Prognosis." Thesis, 2015. http://ndltd.ncl.edu.tw/handle/ps3nj5.
Повний текст джерела國立臺灣科技大學
工業管理系
103
This study aims to deal with the class imbalance problem by using the concept of Information Granulation (IG). Majority classes of data are assembled into granules to balance the ratio of classes within data. This process can reduce the risk of critical information being diluted by large numbers of relatively unimportant data and noises. Three clustering techniques, dynamic clustering using particle swarm optimization (DCPSO), genetic algorithm K-means (GA K-means), and artificial bee colony K-means (ABC K-means) are implemented to construct information granules. Thus, three granular computing (GrC) models are proposed in this study in order to solve the problem of class imbalance. At the end of the procedure, classifiers are applied to construct the classification models for each data. With the help of benchmark data sets on UCI Machine Learning Repository, the effectiveness of proposed GrC models have been evaluated. Since the proposed models have the ability to produce solid results of classification, real world data for survival length of patients with prostate cancer were used implemented to construct a prognosis system. The classification results are also very promising. The results indicate that the proposed GrC models are capable of reducing the difficulties of classification for imbalanced data. Furthermore, the proposed GrC models truly help raise the accuracies of minorities and most of the overall accuracies. Computational results of prostate cancer prognosis give the doctors better information and analysis for the patients’ survival conditions of prostate cancer.
Soares, Jastin Pompeu. "Explorar diferentes estratégias de data mining aplicadas a dois problemas no pré-processamento de dados." Master's thesis, 2017. http://hdl.handle.net/10316/83131.
Повний текст джерелаCom o aumento de volumes de dados, melhorias tecnológicas, e a necessidade crescente em extrairconhecimento de dados, as técnicas de Machine Learning têm sido alvo de grande estudo, focandoseas principais contribuições no desenvolvimento e melhoria dos seus algoritmos. Nesse contexto,a qualidade dos dados é um ponto crucial na obtenção de bons resultados. Incluído na análisede dados, o pré-processamento é uma das etapas da extração de conhecimentos que possibilita amelhoria da qualidade dos dados. Esta dissertação visa contribuir em dois problemas que podemsurgir na fase de pré-processamento: dados incompletos e dados não balanceados.Para resolver o primeiro problema, os investigadores usam tipicamente estratégias brute-forceque, para além do seu elevado custo computacional, não têm em consideração a natureza dosdados e, portanto, não possibilitam a sua generalização para diferentes contextos. Neste trabalho éexplorada a relação entre o desempenho das técnicas de imputação estado-da-arte e a distribuiçãodos dados, procurando desenvolver uma heurística que permita escolher a técnica de imputaçãomais apropriada para cada variável incluída no estudo, evitando a necessidade de testar váriastécnicas. Os resultados mostram que existe uma relação entre a distribuição das variáveis e odesempenho dos algoritmos. Este desempenho parece ser influenciado pela estratégia e taxa degeração dos dados em falta.No segundo problema pretende-se medir o desempenho dos classificadores em contextos de dadosnão balanceados. A abordagem utilizada para proceder à validação cruzada (antes ou depois dopré-processamento) pode levar a desempenhos sobre-otimistas, aquando da aplicação de técnicasde sobre-amostragem para atenuar a diferença entre classes. Este trabalho visa mostrar qual aabordagem mais correta na validação cruzada e relacionar o motivo do sobre-otimismo com acomplexidade dos datasets. Os resultados demostram que a abordagem de validação cruzada maisadequada é aquela onde a divisão do dataset é efetuada antes do pré-processamento, e o sobreotimismoaparenta estar relacionado com a semelhança na complexidade dos conjuntos de treino eteste.
With increasing volumes of data, technological improvements, and the need to extract knowledgefrom data, Machine Learning techniques have been subjected to great study, where the main contributionsare currently focused in the development and improvement of algorithms. In this context,data quality is a crucial point to achieve good results. Included in data analysis, preprocessing isone of the stages of knowledge-discovery in databases that enables the improvement of data quality.This dissertation aims to contribute to two problems that may arise in the preprocessing stage:Missing Data and Imbalanced Data.To solve the first problem, researchers typically use brute-force strategies that, in addition totheir high computational cost, do not take into account the nature of the data and therefore donot allow their generalization to different contexts. In this work, the relationship between theperformance of the state-of-art imputation techniques and the data distribution is explored, bytrying to develop a heuristic that allows choosing the most appropriate imputation technique foreach feature included in the study, to avoid the need of testing several techniques. The results showthat there is a relationship between the features’ distributions and the imputation performance.This performance seems to be influenced by the strategy and rate of the missing data generation.In the second problem, the intention is to measure the performance of classifiers in imbalanceddata contexts. The approach used to perform cross-validation (before or after pre-processing)can lead to over-optimistic performances when applying oversampling techniques to attenuate thebetween-class imbalance. This work aims to show the most correct approach of cross-validationand to relate the over-optimistic performance with the datasets’ complexity. The results show thatthe most appropriate cross-validation approach is the one where the dataset splitting is performedbefore the pre-processing stage, and over-optimistic performances seem to be related to the similarityof the complexity of training and test sets.
Lin, Li, and 林立. "Integration of Particle Swarm K-means Optimization Algorithm and Granular Computing for Imbalanced Data Classification Problem - A Case Study on Prostate Cancer Prognosis." Thesis, 2013. http://ndltd.ncl.edu.tw/handle/58965099342188347198.
Повний текст джерела國立臺灣科技大學
工業管理系
101
In Taiwan, the morbidity of prostate cancer is the fifth of cancer of men, and the mortality is the seventh. Recently, men suffering from prostate cancer gradually increase every year. Currently, prognosis of prostate cancer is discriminated according to the five-year survival rate. It has become a critical issue of how to estimate the life expectancy of prostate cancer. However, pathological data are usually characterized as skewed distribution, easily leading to errors in judging pathology. In order to decrease the errors in judging pathology, this study focus on the problem of class imbalance. This study attempts to propose a PSKO-based granular computing( GrC ) model to preprocess the skewed class distribution. GrC model acquires knowledge from information granules rather than from numerical data, and process multidimensional and sparse data by using Singular Value Decomposition and Latent Semantic Indexing (LSI). The data possessing features of multi-dimension and scarcity can be preprocessed by using LSI to reduce the data dimension and records. In addition, the proposed method employed ten data sets from the UCI Machine Learning Repository to demonstrate the effectiveness of our methodology. Experimental results indicate that the proposed model of information granules has promising performance of classifying both imbalanced data and balanced data. PSKO-based granular computing( GrC ) model can not only obtain better information granules, but also increase the ability of classifying imbalanced data. It is also able to support physicians in judging the pathological condition of prostate cancer of patients and survival rate.