Dissertations / Theses on the topic 'Data imbalance problem'

To see the other types of publications on this topic, follow the link: Data imbalance problem.

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 17 dissertations / theses for your research on the topic 'Data imbalance problem.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Gao, Jie. "Data Augmentation in Solving Data Imbalance Problems." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-289208.

Full text
Abstract:
This project mainly focuses on the various methods of solving data imbalance problems in the Natural Language Processing (NLP) field. Unbalanced text data is a common problem in many tasks especially the classification task, which leads to the model not being able to predict the minority class well. Sometimes, even we change to some more excellent and complicated model could not improve the performance, while some simple data strategies that focus on solving data imbalanced problems such as over-sampling or down-sampling produce positive effects on the result. The common data strategies include some re-sampling methods that duplicate new data from the original data or remove some original data to have the balance. Except for that, some other methods such as word replacement, word swap, and word deletion are used in previous work as well. At the same time, some deep learning models like BERT, GPT and fastText model, which have a strong ability for a general understanding of natural language, so we choose some of them to solve the data imbalance problem. However, there is no systematic comparison in practicing these methods. For example, over-sampling and down-sampling are fast and easy to use in previous small scales of datasets. With the increase of the dataset, the newly generated data by some deep network models is more compatible with the original data. Therefore, our work focus on how is the performance of various data augmentation techniques when they are used to solve data imbalance problems, given the dataset and task? After the experiment, Both qualitative and quantitative experimental results demonstrate that different methods have their advantages for various datasets. In general, data augmentation could improve the performance of classification models. For specific, BERT especially our fine-tuned BERT has an excellent ability in most using scenarios(different scales and types of the dataset). Still, other techniques such as Back-translation has a better performance in long text data, even it costs more time and has a complicated model. In conclusion, suitable choices for data augmentation methods could help to solve data imbalance problems.
Detta projekt fokuserar huvudsakligen på de olika metoderna för att lösa dataobalansproblem i fältet Natural Language Processing (NLP). Obalanserad textdata är ett vanligt problem i många uppgifter, särskilt klassificeringsuppgiften, vilket leder till att modellen inte kan förutsäga minoriteten Ibland kan vi till och med byta till en mer utmärkt och komplicerad modell inte förbättra prestandan, medan några enkla datastrategier som fokuserar på att lösa data obalanserade problem som överprov eller nedprovning ger positiva effekter på resultatet. vanliga datastrategier inkluderar några omprovningsmetoder som duplicerar nya data från originaldata eller tar bort originaldata för att få balans. Förutom det används vissa andra metoder som ordbyte, ordbyte och radering av ord i tidigare arbete Samtidigt har vissa djupinlärningsmodeller som BERT, GPT och fastText-modellen, som har en stark förmåga till en allmän förståelse av naturliga språk, så vi väljer några av dem för att lösa problemet med obalans i data. Det finns dock ingen systematisk jämförelse när man praktiserar dessa metoder. Exempelvis är överprovtagning och nedprovtagning snabba och enkla att använda i tidigare små skalor av datamängder. Med ökningen av datauppsättningen är de nya genererade data från vissa djupa nätverksmodeller mer kompatibla med originaldata. Därför fokuserar vårt arbete på hur prestandan för olika dataförstärkningstekniker används när de används för att lösa dataobalansproblem, givet datamängden och uppgiften? Efter experimentet visar både kvalitativa och kvantitativa experimentella resultat att olika metoder har sina fördelar för olika datamängder. I allmänhet kan dataförstärkning förbättra prestandan hos klassificeringsmodeller. För specifika, BERT speciellt vår finjusterade BERT har en utmärkt förmåga i de flesta med hjälp av scenarier (olika skalor och typer av datamängden). Ändå har andra tekniker som Back-translation bättre prestanda i lång textdata, till och med det kostar mer tid och har en komplicerad modell. Sammanfattningsvis lämpliga val för metoder för dataökning kan hjälpa till att lösa problem med obalans i data.
APA, Harvard, Vancouver, ISO, and other styles
2

Barella, Victor Hugo. "Técnicas para o problema de dados desbalanceados em classificação hierárquica." Universidade de São Paulo, 2015. http://www.teses.usp.br/teses/disponiveis/55/55134/tde-06012016-145045/.

Full text
Abstract:
Os recentes avanços da ciência e tecnologia viabilizaram o crescimento de dados em quantidade e disponibilidade. Junto com essa explosão de informações geradas, surge a necessidade de analisar dados para descobrir conhecimento novo e útil. Desse modo, áreas que visam extrair conhecimento e informações úteis de grandes conjuntos de dados se tornaram grandes oportunidades para o avanço de pesquisas, tal como o Aprendizado de Máquina (AM) e a Mineração de Dados (MD). Porém, existem algumas limitações que podem prejudicar a acurácia de alguns algoritmos tradicionais dessas áreas, por exemplo o desbalanceamento das amostras das classes de um conjunto de dados. Para mitigar tal problema, algumas alternativas têm sido alvos de pesquisas nos últimos anos, tal como o desenvolvimento de técnicas para o balanceamento artificial de dados, a modificação dos algoritmos e propostas de abordagens para dados desbalanceados. Uma área pouco explorada sob a visão do desbalanceamento de dados são os problemas de classificação hierárquica, em que as classes são organizadas em hierarquias, normalmente na forma de árvore ou DAG (Direct Acyclic Graph). O objetivo deste trabalho foi investigar as limitações e maneiras de minimizar os efeitos de dados desbalanceados em problemas de classificação hierárquica. Os experimentos realizados mostram que é necessário levar em consideração as características das classes hierárquicas para a aplicação (ou não) de técnicas para tratar problemas dados desbalanceados em classificação hierárquica.
Recent advances in science and technology have made possible the data growth in quantity and availability. Along with this explosion of generated information, there is a need to analyze data to discover new and useful knowledge. Thus, areas for extracting knowledge and useful information in large datasets have become great opportunities for the advancement of research, such as Machine Learning (ML) and Data Mining (DM). However, there are some limitations that may reduce the accuracy of some traditional algorithms of these areas, for example the imbalance of classes samples in a dataset. To mitigate this drawback, some solutions have been the target of research in recent years, such as the development of techniques for artificial balancing data, algorithm modification and new approaches for imbalanced data. An area little explored in the data imbalance vision are the problems of hierarchical classification, in which the classes are organized into hierarchies, commonly in the form of tree or DAG (Direct Acyclic Graph). The goal of this work aims at investigating the limitations and approaches to minimize the effects of imbalanced data with hierarchical classification problems. The experimental results show the need to take into account the features of hierarchical classes when deciding the application of techniques for imbalanced data in hierarchical classification.
APA, Harvard, Vancouver, ISO, and other styles
3

Gao, Ming. "A study on imbalanced data classification problems." Thesis, University of Reading, 2013. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.602707.

Full text
Abstract:
This thesis focuses on the study of machine learning and pattern recognition algorithms for imbalanced data problems. The imbalanced problems are important as they are prevalent in life threatening/safety critical applications. They are known to be problematic to standard machine learning algorithms due to the imbalanced distribution between positive and negative classes. My original contribution to knowledge in this field is fourfold. A powerful and efficient algorithm for solving two-class imbalanced problems is proposed. The proposed method combines the synthetic minority over-sampling technique and the radial basis function classifier optimised by particle swarm optimization to enhance the classifier's performance for imbalanced learning. An over-sampling technique for imbalanced problems, probability density function estimation based over-sampling, is proposed. In contrast to existing over-sampling techniques that lack sufficient theoretical insights and justifications, the synthetic data samples are generated from the estimated probability density function from the positive data via the Parzen-window. A unified neurofuzzy modelling scheme is proposed. A novel initial rule construction method on the subspaces of the input features is formed. The supervised subspace orthogonal least square learning for model construction is applied. A logistic regression model is formed to present the classifiers output. Based on the formation of the unified neurofuzzy model, a new class of neurofuzzy construction algorithms is proposed with the aim of maximizing generalization capability specifically for imbalanced data classification based on leave-one-out cross-validation.
APA, Harvard, Vancouver, ISO, and other styles
4

Jeatrakul, Piyasak. "Enhancing classification performance over noise and imbalanced data problems." Thesis, Jeatrakul, Piyasak (2012) Enhancing classification performance over noise and imbalanced data problems. PhD thesis, Murdoch University, 2012. https://researchrepository.murdoch.edu.au/id/eprint/10044/.

Full text
Abstract:
This research presents the development of techniques to handle two issues in data classification: noise and imbalanced data problems. Noise is a significant problem that can degrade the quality of training data in any learning algorithm. Learning algorithms trained by noisy instances generally increase misclassification when they perform classification. As a result, the classification performance tends to decrease. Meanwhile, the imbalanced data problem is another problem affecting the performance of learning algorithms. If some classes have a much larger number of instances than the others, the learning algorithms tend to be dominated by the features of the majority classes, and the features of the minority classes are difficult to recognise. As a result, the classification performance of the minority classes could be significantly lower than that of the majority classes. It is therefore important to implement techniques to better handle the negative effects of noise and imbalanced data problems. Although there are several approaches attempting to handle noise and imbalanced data problems, shortcomings of the available approaches still exist. For the noise handling techniques, even though the noise tolerant approach does not require any data preprocessing, it can tolerate only a certain amount of noise. The classifier developed from noisy data tends to be less predictive if the training data contains a great number of noise instances. Furthermore, for the noise elimination approach, although it can be easily applied to various problem domains, it could degrade the quality of training data if it cannot distinguish between noise and rare cases (exceptions). Besides, for the imbalanced data problem, the available techniques used still present some limitations. For example, the algorithm-level approach can perform effectively only on specific problem domains or specific learning algorithms. The data-level approach can either eliminate necessary information from the training set or produce the over-fitting problem over the minority class. Moreover, when the imbalanced data problem becomes more complex, such as for the case of multi-class classification, it is difficult to apply the re-sampling techniques (the data-level approach), which perform effectively for imbalanced data problems in binary classification, to the multi-class classification. Due to the limitations above, these lead to the motivation of this research to propose and investigate techniques to handle noise and imbalanced data problems more effectively. This thesis has developed three new techniques to overcome the identified problems. Firstly, a cleaning technique called the Complementary Neural Network (CMTNN) data cleaning technique has been developed in order to remove noise (misclassification data) from the training set. The results show that the new noise detection and removal technique can eliminate noise with confidence. Furthermore, the CMTNN cleaning technique can increase the classification accuracy across different learning algorithms, which are Artificial Neural Network (ANN), Support Vector Machine (SVM), k- Nearest Neighbor (k-NN), and Decision Tree (DT). It can provide higher classification performance than other cleaning methods such as Tomek links, the majority voting filtering, and the consensus voting filtering. Secondly, the CMTNN re-sampling technique, which is a new under-sampling technique, has been developed to handle the imbalanced data problem in binary classification. The results show that the combined techniques of the CMTNN resampling technique and Synthetic Minority Over-sampling Technique (SMOTE) can perform effectively by improving the classification performance of the minority class instances in terms of Geometric Mean (G-Mean) and the area under the Receiver Operating Characteristic (ROC) curve. It generally provides higher performance than other re-sampling techniques such as Tomek links, Wilson’s Edited Nearest Neighbor Rule (ENN), SMOTE, the combined technique of SMOTE and ENN, and the combined technique of SMOTE and Tomek links. For the third proposed technique, an algorithm named One-Against-All with Data Balancing (OAA-DB) has been developed in order to deal with the imbalanced data problem in multi-class classification. It can be asserted that this algorithm not only improves the performance for the minority class but it also maintains the overall accuracy, which is normally reduced by other techniques. The OAA-DB algorithm can increase the performance in terms of the classification accuracy and F-measure when compared to other multi-class classification approaches including One-Against-All (OAA), One-Against-One (OAO), All and One (A&O), and One Against Higher Order (OAHO) approaches. Furthermore, this algorithm has shown that the re-sampling technique is not only used effectively for the class imbalance problem in binary classification but it has been also applied successfully to the imbalanced data problem in multi-class classification.
APA, Harvard, Vancouver, ISO, and other styles
5

Pan, Yi-Ying, and 潘怡瑩. "Clustering-based Data Preprocessing Approach for the Class Imbalance Problem." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/94nys8.

Full text
Abstract:
碩士
國立中央大學
資訊管理學系
106
The class imbalance problem is an important issue in data mining. It occurs when the number of samples in one class is much larger than the other classes. Traditional classifiers tend to misclassify most samples of the minority class into the majority class for maximizing the overall accuracy. This phenomenon makes it hard to establish a good classification rule for the minority class. The class imbalance problem often occurs in many real world applications, such as fault diagnosis, medical diagnosis and face recognition. To deal with the class imbalance problem, a clustering-based data preprocessing approach is proposed, where two different clustering techniques including affinity propagation clustering and K-means clustering are used individually to divide the majority class into several subclasses resulting in multiclass data. This approach can effectively reduce the class imbalance ratio of the training dataset, shorten the class training time and improve classification performance. Our experiments based on forty-four small class imbalance datasets from KEEL and eight high-dimensional datasets from NASA to build five types of classification models, which are C4.5, MLP, Naïve Bayes, SVM and k-NN (k=5). In addition, we also employ the classifier ensemble algorithm. This research tries to compare AUC results between different clustering techniques, different classification models and the number of clusters of K-means clustering in order to find out the best configuration of the proposed approach and compare with other literature methods. Finally, the experimental results of the KEEL datasets show that k-NN (k=5) algorithm is the best choice regardless of whether affinity propagation or K-means (K=5); the experimental results of NASA datasets show that the performance of the proposed approach is superior to the literature methods for the high-dimensional datasets.
APA, Harvard, Vancouver, ISO, and other styles
6

Komba, Lyee, and Lyee Komba. "Sampling Techniques for Class Imbalance Problem in Aviation Safety Incidents Data." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/jg2y52.

Full text
Abstract:
碩士
國立臺北科技大學
電資國際專班
106
Like any other industries in the world, the aviation industry has a variety data acquired everyday through numerous data management systems. Structured and unstructured data are being collected through aircraft systems, maintenance systems, supply systems, ticketing and booking systems, and many other systems that are utilized in the daily operations of aviation business. Data mining can be used to analyze all these different types of data to generate meaningful information that can improve future performance, safety and profitability for aviation business and operations. This thesis presents details of data mining methods based on aviation incident data to predict incidents with fatal or a death consequence. Other literature have applied data mining techniques within the aviation industry include prediction of passenger travel, meteorological prediction, component failure prediction and other fatal incident prediction literature that aimed at finding the right features. This study uses the public dataset from the Federal Aviation Authority Accidents and Incidents Data System (FAA AIDS) website – data records from the year 2000 to year 2017. Our goal is to build a prediction model for fatal incidents and generate decision rules or factors contributing to incidents that have fatal results. In this way, the model to be built will be a predictive risk management system for aviation safety. The aviation industry generally operates at a safe state because of the transition from reactive safety and risk management to a proactive safety management approach; and now a predictive approach to safety management with the application of data mining techniques such as from this study and others. Over time, the number of systems has increased and the number of aviation accidents and serious incidents has decreased. Hence, a 0.6% of incidents with fatal consequences was attained from our analysis. During the data preprocessing stage, a problem of unbalanced dataset is encountered that invokes us to propose some techniques to solve the issue. Unbalanced datasets are datasets where least number of data is representing the minority classes than the majority class, especially when the analysis is aimed at the minority class. Not dealing with this issue correctly may result in poor performing models or misclassified data. With the increase of the travelling population in the aviation community, safety is paramount so coming up with a relatively precise model is important. In order to come up with a precise model/classifier, we need to preprocess and resample the data efficiently. This thesis also looks at combating the issue of unbalanced data to come up with a balanced data that can be used to train a classifier to design a precise model. We applied the following sampling technique in R Studio– oversampling, under-sampling, SMOTE and bootstrap samples to solve the imbalanced data. The resulting dataset from the unbalanced dataset resolution techniques are used to train different classifiers and the performance of the classifiers are measured and discussed in this thesis.
APA, Harvard, Vancouver, ISO, and other styles
7

Yao, Guan-Ting, and 姚冠廷. "A Two-Stage Hybrid Data Preprocessing Approach for the Class Imbalance Problem." Thesis, 2017. http://ndltd.ncl.edu.tw/handle/dm48kk.

Full text
Abstract:
碩士
國立中央大學
資訊管理學系
105
The class imbalance problem is an important issue in data mining. The class skewed distribution occurs when the number of examples that represent one class is much lower than the ones of the other classes. The traditional classifiers tend to misclassify most samples in the minority class into the majority class because of maximizing the overall accuracy. This phenomenon limits the construction of effective classifiers for the precious minority class. This problem occurs in many real-world applications, such as fault diagnosis, medical diagnosis and face recognition. To deal with the class imbalance problem, I proposed a two-stage hybrid data preprocessing framework based on clustering and instance selection techniques. This approach filters out the noisy data in the majority class and can reduce the execution time for classifier training. More importantly, it can decrease the effect of class imbalance and perform very well in the classification task. Our experiments using 44 class imbalance datasets from KEEL to build four types of classification models, which are C4.5, k-NN, Naïve Bayes and MLP. In addition, the classifier ensemble algorithm is also employed. In addition, two kinds of clustering techniques and three kinds of instance selection algorithms are used in order to find out the best combination suited for the proposed method. The experimental results show that the proposed framework performs better than many well-known state-of-the-art approaches in terms of AUC. In particular, the proposed framework combined with bagging based MLP ensemble classifiers perform the best, which provide 92% of AUC.
APA, Harvard, Vancouver, ISO, and other styles
8

吳思翰. "Combine Particle Swarm Optimization and Mahalonobis-Taguchi System for Solving Classification Problem in Imbalance Data." Thesis, 2010. http://ndltd.ncl.edu.tw/handle/06887158161687794935.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Chang, Yu-shan, and 張毓珊. "Developing Data Mining Models for Class Imbalance Problems." Thesis, 2009. http://ndltd.ncl.edu.tw/handle/57781951199735409394.

Full text
Abstract:
碩士
朝陽科技大學
資訊管理系碩士班
98
In classification problems, the class imbalance problem would cause a bias on the training of classifiers and result in a low predictive accuracy over the minority class examples. This problem is caused by imbalanced data in which almost all examples belong to one class and far fewer instances belong to others. Compared with the majority examples, the minority examples are usually more interesting class, such as rare diseases in medical diagnosis data, failures in inspection data, frauds in credit screening data, and so on. When inducing knowledge from an imbalanced data set, traditional data mining algorithms will seek high classification accuracy for the majority class, but an unacceptable error rate for the minority class. Therefore, they are not suitable for handling the class imbalanced data. In order to tackle the class imbalance problem, this study aims to (1) find a robust classifier from different candidates including Decision Tree (DT), Logistic Regression (LR), Mahalanobis Distance (MD), and Support Vector Machines (SVM); (2) propose two novel methods called MD-SVM (a new two-phase learning scheme) and SWAI (SOM Weights As Input). Experimental results indicated our proposed MD-SVM and SWAI has better performance in identifying the minority class examples compared with traditional techniques such as under-sampling, cost adjusting, and cluster based sampling.
APA, Harvard, Vancouver, ISO, and other styles
10

Liu, Yi-Hsun, and 劉奕勛. "Deep Discriminative Features Learning and Sampling for Imbalanced Data Problem." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/3cc7k8.

Full text
Abstract:
碩士
國立交通大學
資訊科學與工程研究所
106
The imbalanced data problem occurs in many application domains and is considered to be a challenging problem in machine learning and data mining. Oversampling may lead to overfitting, while undersampling may discard representative data samples. Additionally, most resampling methods for synthetic data focus on minority class without considering the data distribution of major classes. This paper presents an algorithm that combines feature embedding with the loss functions from discriminative feature learning in deep learning to generate synthetic data samples. In contrast to previous works, the proposed method considers both majority classes and minority classes to learn feature embeddings and utilizes appropriate loss functions to make feature embedding as discriminative as possible. The proposed method is a comprehensive framework and different feature extractors can be utilized for different domains. We conduct experiments utilizing eight numerical datasets and one image dataset based on multiclass classification tasks. The experimental results indicate that the proposed method provides accurate and stable results. Additionally, we thoroughly investigate the proposed method and utilize a visualization technique to determine why the proposed method can generate good data samples.
APA, Harvard, Vancouver, ISO, and other styles
11

Cieslak, David A. "Finding problems in, proposing solutions to, and performing analysis on imbalanced data." 2009. http://etd.nd.edu/ETD-db/theses/available/etd-07082009-100035/.

Full text
Abstract:
Thesis (Ph. D.)--University of Notre Dame, 2009.
Thesis directed by Nitesh Chawla for the Department of Computer Science and Engineering. "July 2009." Includes bibliographical references (leaves 230-241).
APA, Harvard, Vancouver, ISO, and other styles
12

"BagStack Classification for Data Imbalance Problems with Application to Defect Detection and Labeling in Semiconductor Units." Doctoral diss., 2019. http://hdl.handle.net/2286/R.I.53957.

Full text
Abstract:
abstract: Despite the fact that machine learning supports the development of computer vision applications by shortening the development cycle, finding a general learning algorithm that solves a wide range of applications is still bounded by the ”no free lunch theorem”. The search for the right algorithm to solve a specific problem is driven by the problem itself, the data availability and many other requirements. Automated visual inspection (AVI) systems represent a major part of these challenging computer vision applications. They are gaining growing interest in the manufacturing industry to detect defective products and keep these from reaching customers. The process of defect detection and classification in semiconductor units is challenging due to different acceptable variations that the manufacturing process introduces. Other variations are also typically introduced when using optical inspection systems due to changes in lighting conditions and misalignment of the imaged units, which makes the defect detection process more challenging. In this thesis, a BagStack classification framework is proposed, which makes use of stacking and bagging concepts to handle both variance and bias errors. The classifier is designed to handle the data imbalance and overfitting problems by adaptively transforming the multi-class classification problem into multiple binary classification problems, applying a bagging approach to train a set of base learners for each specific problem, adaptively specifying the number of base learners assigned to each problem, adaptively specifying the number of samples to use from each class, applying a novel data-imbalance aware cross-validation technique to generate the meta-data while taking into account the data imbalance problem at the meta-data level and, finally, using a multi-response random forest regression classifier as a meta-classifier. The BagStack classifier makes use of multiple features to solve the defect classification problem. In order to detect defects, a locally adaptive statistical background modeling is proposed. The proposed BagStack classifier outperforms state-of-the-art image classification techniques on our dataset in terms of overall classification accuracy and average per-class classification accuracy. The proposed detection method achieves high performance on the considered dataset in terms of recall and precision.
Dissertation/Thesis
Doctoral Dissertation Computer Engineering 2019
APA, Harvard, Vancouver, ISO, and other styles
13

Esteves, Vitor Miguel Saraiva. "Techniques to deal with imbalanced data in multi-class problems: A review of existing methods." Master's thesis, 2020. https://hdl.handle.net/10216/126820.

Full text
APA, Harvard, Vancouver, ISO, and other styles
14

Esteves, Vitor Miguel Saraiva. "Techniques to deal with imbalanced data in multi-class problems: A review of existing methods." Dissertação, 2020. https://hdl.handle.net/10216/126820.

Full text
APA, Harvard, Vancouver, ISO, and other styles
15

SU, PO-YU, and 蘇柏瑜. "Integrating Clustering Analysis with Granular Computing for Imbalanced Data Classification Problem─A Case Study on Prostate Cancer Prognosis." Thesis, 2015. http://ndltd.ncl.edu.tw/handle/ps3nj5.

Full text
Abstract:
碩士
國立臺灣科技大學
工業管理系
103
This study aims to deal with the class imbalance problem by using the concept of Information Granulation (IG). Majority classes of data are assembled into granules to balance the ratio of classes within data. This process can reduce the risk of critical information being diluted by large numbers of relatively unimportant data and noises. Three clustering techniques, dynamic clustering using particle swarm optimization (DCPSO), genetic algorithm K-means (GA K-means), and artificial bee colony K-means (ABC K-means) are implemented to construct information granules. Thus, three granular computing (GrC) models are proposed in this study in order to solve the problem of class imbalance. At the end of the procedure, classifiers are applied to construct the classification models for each data. With the help of benchmark data sets on UCI Machine Learning Repository, the effectiveness of proposed GrC models have been evaluated. Since the proposed models have the ability to produce solid results of classification, real world data for survival length of patients with prostate cancer were used implemented to construct a prognosis system. The classification results are also very promising. The results indicate that the proposed GrC models are capable of reducing the difficulties of classification for imbalanced data. Furthermore, the proposed GrC models truly help raise the accuracies of minorities and most of the overall accuracies. Computational results of prostate cancer prognosis give the doctors better information and analysis for the patients’ survival conditions of prostate cancer.
APA, Harvard, Vancouver, ISO, and other styles
16

Soares, Jastin Pompeu. "Explorar diferentes estratégias de data mining aplicadas a dois problemas no pré-processamento de dados." Master's thesis, 2017. http://hdl.handle.net/10316/83131.

Full text
Abstract:
Dissertação de Mestrado Integrado em Engenharia Electrotécnica e de Computadores apresentada à Faculdade de Ciências e Tecnologia
Com o aumento de volumes de dados, melhorias tecnológicas, e a necessidade crescente em extrairconhecimento de dados, as técnicas de Machine Learning têm sido alvo de grande estudo, focandoseas principais contribuições no desenvolvimento e melhoria dos seus algoritmos. Nesse contexto,a qualidade dos dados é um ponto crucial na obtenção de bons resultados. Incluído na análisede dados, o pré-processamento é uma das etapas da extração de conhecimentos que possibilita amelhoria da qualidade dos dados. Esta dissertação visa contribuir em dois problemas que podemsurgir na fase de pré-processamento: dados incompletos e dados não balanceados.Para resolver o primeiro problema, os investigadores usam tipicamente estratégias brute-forceque, para além do seu elevado custo computacional, não têm em consideração a natureza dosdados e, portanto, não possibilitam a sua generalização para diferentes contextos. Neste trabalho éexplorada a relação entre o desempenho das técnicas de imputação estado-da-arte e a distribuiçãodos dados, procurando desenvolver uma heurística que permita escolher a técnica de imputaçãomais apropriada para cada variável incluída no estudo, evitando a necessidade de testar váriastécnicas. Os resultados mostram que existe uma relação entre a distribuição das variáveis e odesempenho dos algoritmos. Este desempenho parece ser influenciado pela estratégia e taxa degeração dos dados em falta.No segundo problema pretende-se medir o desempenho dos classificadores em contextos de dadosnão balanceados. A abordagem utilizada para proceder à validação cruzada (antes ou depois dopré-processamento) pode levar a desempenhos sobre-otimistas, aquando da aplicação de técnicasde sobre-amostragem para atenuar a diferença entre classes. Este trabalho visa mostrar qual aabordagem mais correta na validação cruzada e relacionar o motivo do sobre-otimismo com acomplexidade dos datasets. Os resultados demostram que a abordagem de validação cruzada maisadequada é aquela onde a divisão do dataset é efetuada antes do pré-processamento, e o sobreotimismoaparenta estar relacionado com a semelhança na complexidade dos conjuntos de treino eteste.
With increasing volumes of data, technological improvements, and the need to extract knowledgefrom data, Machine Learning techniques have been subjected to great study, where the main contributionsare currently focused in the development and improvement of algorithms. In this context,data quality is a crucial point to achieve good results. Included in data analysis, preprocessing isone of the stages of knowledge-discovery in databases that enables the improvement of data quality.This dissertation aims to contribute to two problems that may arise in the preprocessing stage:Missing Data and Imbalanced Data.To solve the first problem, researchers typically use brute-force strategies that, in addition totheir high computational cost, do not take into account the nature of the data and therefore donot allow their generalization to different contexts. In this work, the relationship between theperformance of the state-of-art imputation techniques and the data distribution is explored, bytrying to develop a heuristic that allows choosing the most appropriate imputation technique foreach feature included in the study, to avoid the need of testing several techniques. The results showthat there is a relationship between the features’ distributions and the imputation performance.This performance seems to be influenced by the strategy and rate of the missing data generation.In the second problem, the intention is to measure the performance of classifiers in imbalanceddata contexts. The approach used to perform cross-validation (before or after pre-processing)can lead to over-optimistic performances when applying oversampling techniques to attenuate thebetween-class imbalance. This work aims to show the most correct approach of cross-validationand to relate the over-optimistic performance with the datasets’ complexity. The results show thatthe most appropriate cross-validation approach is the one where the dataset splitting is performedbefore the pre-processing stage, and over-optimistic performances seem to be related to the similarityof the complexity of training and test sets.
APA, Harvard, Vancouver, ISO, and other styles
17

Lin, Li, and 林立. "Integration of Particle Swarm K-means Optimization Algorithm and Granular Computing for Imbalanced Data Classification Problem - A Case Study on Prostate Cancer Prognosis." Thesis, 2013. http://ndltd.ncl.edu.tw/handle/58965099342188347198.

Full text
Abstract:
碩士
國立臺灣科技大學
工業管理系
101
In Taiwan, the morbidity of prostate cancer is the fifth of cancer of men, and the mortality is the seventh. Recently, men suffering from prostate cancer gradually increase every year. Currently, prognosis of prostate cancer is discriminated according to the five-year survival rate. It has become a critical issue of how to estimate the life expectancy of prostate cancer. However, pathological data are usually characterized as skewed distribution, easily leading to errors in judging pathology. In order to decrease the errors in judging pathology, this study focus on the problem of class imbalance. This study attempts to propose a PSKO-based granular computing( GrC ) model to preprocess the skewed class distribution. GrC model acquires knowledge from information granules rather than from numerical data, and process multidimensional and sparse data by using Singular Value Decomposition and Latent Semantic Indexing (LSI). The data possessing features of multi-dimension and scarcity can be preprocessed by using LSI to reduce the data dimension and records. In addition, the proposed method employed ten data sets from the UCI Machine Learning Repository to demonstrate the effectiveness of our methodology. Experimental results indicate that the proposed model of information granules has promising performance of classifying both imbalanced data and balanced data. PSKO-based granular computing( GrC ) model can not only obtain better information granules, but also increase the ability of classifying imbalanced data. It is also able to support physicians in judging the pathological condition of prostate cancer of patients and survival rate.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography