Thèses : « Selective classifier »

1

Sayin, Günel Burcu. « Towards Reliable Hybrid Human-Machine Classifiers ». Doctoral thesis, Università degli studi di Trento, 2022. http://hdl.handle.net/11572/349843.

Texte intégral

Résumé :

In this thesis, we focus on building reliable hybrid human-machine classifiers to be deployed in cost-sensitive classification tasks. The objective is to assess ML quality in hybrid classification contexts and design the appropriate metrics, thereby knowing whether we can trust the model predictions and identifying the subset of items on which the model is well-calibrated and trustworthy. We start by discussing the key concepts, research questions, challenges, and architecture to design and implement an effective hybrid classification service. We then present a deeper investigation of each service component along with our solutions and results. We mainly contribute to cost-sensitive hybrid classification, selective classification, model calibration, and active learning. We highlight the importance of model calibration in hybrid classification services and propose novel approaches to improve the calibration of human-machine classifiers. In addition, we argue that the current accuracy-based metrics are misaligned with the actual value of machine learning models and propose a novel metric ``value". We further test the performance of SOTA machine learning models in NLP tasks with a cost-sensitive hybrid classification context. We show that the performance of the SOTA models in cost-sensitive tasks significantly drops when we evaluate them according to value rather than accuracy. Finally, we investigate the quality of hybrid classifiers in the active learning scenarios. We review the existing active learning strategies, evaluate their effectiveness, and propose a novel value-aware active learning strategy to improve the performance of selective classifiers in the active learning of cost-sensitive tasks.

Styles APA, Harvard, Vancouver, ISO, etc.

2

BOLDT, F. A. « Classifier Ensemble Feature Selection for Automatic Fault Diagnosis ». Universidade Federal do Espírito Santo, 2017. http://repositorio.ufes.br/handle/10/9872.

Texte intégral

Résumé :

Made available in DSpace on 2018-08-02T00:04:07Z (GMT). No. of bitstreams: 1 tese_11215_thesis.pdf: 2358608 bytes, checksum: 6882526be259a3ef945f027bb764d17f (MD5) Previous issue date: 2017-07-14
"An efficient ensemble feature selection scheme applied for fault diagnosis is proposed, based on three hypothesis: a. A fault diagnosis system does not need to be restricted to a single feature extraction model, on the contrary, it should use as many feature models as possible, since the extracted features are potentially discriminative and the feature pooling is subsequently reduced with feature selection; b. The feature selection process can be accelerated, without loss of classification performance, combining feature selection methods, in a way that faster and weaker methods reduce the number of potentially non-discriminative features, sending to slower and stronger methods a filtered smaller feature set; c. The optimal feature set for a multi-class problem might be different for each pair of classes. Therefore, the feature selection should be done using an one versus one scheme, even when multi-class classifiers are used. However, since the number of classifiers grows exponentially to the number of the classes, expensive techniques like Error-Correcting Output Codes (ECOC) might have a prohibitive computational cost for large datasets. Thus, a fast one versus one approach must be used to alleviate such a computational demand. These three hypothesis are corroborated by experiments. The main hypothesis of this work is that using these three approaches together is possible to improve significantly the classification performance of a classifier to identify conditions in industrial processes. Experiments have shown such an improvement for the 1-NN classifier in industrial processes used as case study."

Styles APA, Harvard, Vancouver, ISO, etc.

3

Thapa, Mandira. « Optimal Feature Selection for Spatial Histogram Classifiers ». Wright State University / OhioLINK, 2017. http://rave.ohiolink.edu/etdc/view?acc_num=wright1513710294627304.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

4

Gustafsson, Robin. « Ordering Classifier Chains using filter model feature selection techniques ». Thesis, Blekinge Tekniska Högskola, Institutionen för datalogi och datorsystemteknik, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-14817.

Texte intégral

Résumé :

Context: Multi-label classification concerns classification with multi-dimensional output. The Classifier Chain breaks the multi-label problem into multiple binary classification problems, chaining the classifiers to exploit dependencies between labels. Consequently, its performance is influenced by the chain's order. Approaches to finding advantageous chain orders have been proposed, though they are typically costly. Objectives: This study explored the use of filter model feature selection techniques to order Classifier Chains. It examined how feature selection techniques can be adapted to evaluate label dependence, how such information can be used to select a chain order and how this affects the classifier's performance and execution time. Methods: An experiment was performed to evaluate the proposed approach. The two proposed algorithms, Forward-Oriented Chain Selection (FOCS) and Backward-Oriented Chain Selection (BOCS), were tested with three different feature evaluators. 10-fold cross-validation was performed on ten benchmark datasets. Performance was measured in accuracy, 0/1 subset accuracy and Hamming loss. Execution time was measured during chain selection, classifier training and testing. Results: Both proposed algorithms led to improved accuracy and 0/1 subset accuracy (Friedman & Hochberg, p < 0.05). FOCS also improved the Hamming loss while BOCS did not. Measured effect sizes ranged from 0.20 to 1.85 percentage points. Execution time was increased by less than 3 % in most cases. Conclusions: The results showed that the proposed approach can improve the Classifier Chain's performance at a low cost. The improvements appear similar to comparable techniques in magnitude but at a lower cost. It shows that feature selection techniques can be applied to chain ordering, demonstrates the viability of the approach and establishes FOCS and BOCS as alternatives worthy of further consideration.

Styles APA, Harvard, Vancouver, ISO, etc.

5

Duangsoithong, Rakkrit. « Feature selection and casual discovery for ensemble classifiers ». Thesis, University of Surrey, 2012. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.580345.

Texte intégral

Résumé :

With rapid development of computer and information technology that can improve a large number of applications such as web text mining, intrusion detection, biomedical informatics, gene selection in micro array data, medical data mining, and clinical de- cision support systems, many information databases have been created. However, in some applications especially in the medical area, clinical data may contain hundreds to thousands of features with relatively few samples. A consequence of this problem is increased complexity that leads to degradation in efficiency and accuracy. Moreover, in this high dimensional feature space, many features are possibly irrelevant or redundant and should be removed in order to ensure good generalisation performance. Otherwise, the classifier may over-fit the data, that is the classifier may specialise on features which are not relevant for discrimination. To overcome this problem, feature selection and ensemble classification are applied. In this thesis, an empirical analysis on using bootstrap and random subspace feature selection for multiple classifier system is investigated and bootstrap feature selection and embedded feature ranking for ensemble MLP classifiers along with a stopping criterion based on the out-of-bootstrap estimate are proposed. Moreover, basically, feature selection does not usually take causal discovery into ac- count. However, in some cases such as when the testing distribution is shifted from manipulation by external agent, causal discovery can provide some benefits for feature selection under these uncertainty conditions. It also can learn the underlying data structure, provide better understanding of the data generation process and better accuracy and robustness under uncertainty. Similarly, feature selection mutually enables global causal discovery algorithms to deal with high dimensional data by eliminating irrelevant and redundant features before exploring the causal relationship between features. A redundancy-based ensemble causal feature selection approach using bootstrap and random subspace and a comparison between correlation-based and causal feature selection for ensemble classifiers are analysed. Finally, hybrid correlation-causal feature selection for multiple classifier system is proposed in order to scale up causal discovery and deal with high dimensional features.

Styles APA, Harvard, Vancouver, ISO, etc.

6

Ko, Albert Hung-Ren. « Static and dynamic selection of ensemble of classifiers ». Thèse, Montréal : École de technologie supérieure, 2007. http://proquest.umi.com/pqdweb?did=1467895171&sid=2&Fmt=2&clientId=46962&RQT=309&VName=PQD.

Texte intégral

Résumé :

Thèse (Ph.D.) -- École de technologie supérieure, Montréal, 2007.
"A thesis presented to the École de technologie supérieure in partial fulfillment of the thesis requirement for the degree of the Ph.D. engineering". CaQMUQET Bibliogr. : f. [237]-246. Également disponible en version électronique. CaQMUQET

Styles APA, Harvard, Vancouver, ISO, etc.

7

McCrae, Richard. « The Impact of Cost on Feature Selection for Classifiers ». Thesis, Nova Southeastern University, 2018. http://pqdtopen.proquest.com/#viewpdf?dispub=13423087.

Texte intégral

Résumé :

Supervised machine learning models are increasingly being used for medical diagnosis. The diagnostic problem is formulated as a binary classification task in which trained classifiers make predictions based on a set of input features. In diagnosis, these features are typically procedures or tests with associated costs. The cost of applying a trained classifier for diagnosis may be estimated as the total cost of obtaining values for the features that serve as inputs for the classifier. Obtaining classifiers based on a low cost set of input features with acceptable classification accuracy is of interest to practitioners and researchers. What makes this problem even more challenging is that costs associated with features vary with patients and service providers and change over time.

This dissertation aims to address this problem by proposing a method for obtaining low cost classifiers that meet specified accuracy requirements under dynamically changing costs. Given a set of relevant input features and accuracy requirements, the goal is to identify all qualifying classifiers based on subsets of the feature set. Then, for any arbitrary costs associated with the features, the cost of the classifiers may be computed and candidate classifiers selected based on cost-accuracy tradeoff. Since the number of relevant input features k tends to be large for typical diagnosis problems, training and testing classifiers based on all 2^k – 1 possible non-empty subsets of features is computationally prohibitive. Under the reasonable assumption that the accuracy of a classifier is no lower than that of any classifier based on a subset of its input features, this dissertation aims to develop an efficient method to identify all qualifying classifiers.

This study used two types of classifiers—artificial neural networks and classification trees—that have proved promising for numerous problems as documented in the literature. The approach was to measure the accuracy obtained with the classifiers when all features were used. Then, reduced thresholds of accuracy were arbitrarily established which were satisfied with subsets of the complete feature set. Threshold values for three measures—true positive rates, true negative rates, and overall classification accuracy were considered for the classifiers. Two cost functions were used for the features; one used unit costs and the other random costs. Additional manipulation of costs was also performed.

The order in which features were removed was found to have a material impact on the effort required (removing the most important features first was most efficient, removing the least important features first was least efficient). The accuracy and cost measures were combined to produce a Pareto-Optimal Frontier. There were consistently few elements on this Frontier. At most 15 subsets were on the Frontier even when there were hundreds of thousands of acceptable feature sets. Most of the computational time is taken for training and testing the models. Given costs, models in the Pareto-Optimal Frontier can be efficiently identified and the models may be presented to decision makers. Both the Neural Networks and the Decision Trees performed in a comparable fashion suggesting that any classifier could be employed.

Styles APA, Harvard, Vancouver, ISO, etc.

8

McCrae, Richard Clyde. « The Impact of Cost on Feature Selection for Classifiers ». Diss., NSUWorks, 2018. https://nsuworks.nova.edu/gscis_etd/1057.

Texte intégral

Résumé :

Supervised machine learning models are increasingly being used for medical diagnosis. The diagnostic problem is formulated as a binary classification task in which trained classifiers make predictions based on a set of input features. In diagnosis, these features are typically procedures or tests with associated costs. The cost of applying a trained classifier for diagnosis may be estimated as the total cost of obtaining values for the features that serve as inputs for the classifier. Obtaining classifiers based on a low cost set of input features with acceptable classification accuracy is of interest to practitioners and researchers. What makes this problem even more challenging is that costs associated with features vary with patients and service providers and change over time. This dissertation aims to address this problem by proposing a method for obtaining low cost classifiers that meet specified accuracy requirements under dynamically changing costs. Given a set of relevant input features and accuracy requirements, the goal is to identify all qualifying classifiers based on subsets of the feature set. Then, for any arbitrary costs associated with the features, the cost of the classifiers may be computed and candidate classifiers selected based on cost-accuracy tradeoff. Since the number of relevant input features k tends to be large for typical diagnosis problems, training and testing classifiers based on all 2^k-1 possible non-empty subsets of features is computationally prohibitive. Under the reasonable assumption that the accuracy of a classifier is no lower than that of any classifier based on a subset of its input features, this dissertation aims to develop an efficient method to identify all qualifying classifiers. This study used two types of classifiers – artificial neural networks and classification trees – that have proved promising for numerous problems as documented in the literature. The approach was to measure the accuracy obtained with the classifiers when all features were used. Then, reduced thresholds of accuracy were arbitrarily established which were satisfied with subsets of the complete feature set. Threshold values for three measures –true positive rates, true negative rates, and overall classification accuracy were considered for the classifiers. Two cost functions were used for the features; one used unit costs and the other random costs. Additional manipulation of costs was also performed. The order in which features were removed was found to have a material impact on the effort required (removing the most important features first was most efficient, removing the least important features first was least efficient). The accuracy and cost measures were combined to produce a Pareto-Optimal Frontier. There were consistently few elements on this Frontier. At most 15 subsets were on the Frontier even when there were hundreds of thousands of acceptable feature sets. Most of the computational time is taken for training and testing the models. Given costs, models in the Pareto-Optimal Frontier can be efficiently identified and the models may be presented to decision makers. Both the Neural Networks and the Decision Trees performed in a comparable fashion suggesting that any classifier could be employed.

Styles APA, Harvard, Vancouver, ISO, etc.

9

Pinagé, Felipe Azevedo, et 92-98187-1016. « Handling Concept Drift Based on Data Similarity and Dynamic Classifier Selection ». Universidade Federal do Amazonas, 2017. http://tede.ufam.edu.br/handle/tede/5956.

Texte intégral

Résumé :

Submitted by Divisão de Documentação/BC Biblioteca Central (ddbc@ufam.edu.br) on 2017-10-16T18:53:44Z No. of bitstreams: 2 license_rdf: 0 bytes, checksum: d41d8cd98f00b204e9800998ecf8427e (MD5) Tese - Felipe A. Pinagé.pdf: 1786179 bytes, checksum: 25c2a867ba549f75fe4adf778d3f3ad0 (MD5)
Approved for entry into archive by Divisão de Documentação/BC Biblioteca Central (ddbc@ufam.edu.br) on 2017-10-16T18:54:52Z (GMT) No. of bitstreams: 2 license_rdf: 0 bytes, checksum: d41d8cd98f00b204e9800998ecf8427e (MD5) Tese - Felipe A. Pinagé.pdf: 1786179 bytes, checksum: 25c2a867ba549f75fe4adf778d3f3ad0 (MD5)
Made available in DSpace on 2017-10-16T18:54:52Z (GMT). No. of bitstreams: 2 license_rdf: 0 bytes, checksum: d41d8cd98f00b204e9800998ecf8427e (MD5) Tese - Felipe A. Pinagé.pdf: 1786179 bytes, checksum: 25c2a867ba549f75fe4adf778d3f3ad0 (MD5) Previous issue date: 2017-07-28
FAPEAM - Fundação de Amparo à Pesquisa do Estado do Amazonas
In real-world applications, machine learning algorithms can be employed to perform spam detection, environmental monitoring, fraud detection, web click stream, among others. Most of these problems present an environment that changes over time due to the dynamic generation process of the data and/or due to streaming data. The problem involving classification tasks of continuous data streams has become one of the major challenges of the machine learning domain in the last decades because, since data is not known in advance, it must be learned as it becomes available. In addition, fast predictions about data should be performed to support often real time decisions. Currently in the literature, methods based on accuracy monitoring are commonly used to detect changes explicitly. However, these methods may become infeasible in some real-world applications especially due to two aspects: they may need human operator feedback, and may depend on a significant decrease of accuracy to be able to detect changes. In addition, most of these methods are also incremental learning-based, since they update the decision model for every incoming example. However, this may lead the system to unnecessary updates. In order to overcome these problems, in this thesis, two semi-supervised methods based on estimating and monitoring a pseudo error are proposed to detect changes explicitly. The decision model is updated only after changing detection. In the first method, the pseudo error is calculated using similarity measures by monitoring the dissimilarity between past and current data distributions. The second proposed method employs dynamic classifier selection in order to improve the pseudo error measurement. As a consequence, this second method allows classifier ensemble online self-training. The experiments conducted show that the proposed methods achieve competitive results, even when compared to fully supervised incremental learning methods. The achievement of these methods, especially the second method, is relevant since they lead change detection and reaction to be applicable in several practical problems reaching high accuracy rates, where usually is not possible to generate the true labels of the instances fully and immediately after classification.
Em aplicações do mundo real, algoritmos de aprendizagem de máquina podem ser usados para detecção de spam, monitoramento ambiental, detecção de fraude, fluxo de cliques na Web, dentre outros. A maioria desses problemas apresenta ambientes que sofrem mudanças com o passar do tempo, devido à natureza dinâmica de geração dos dados e/ou porque envolvem dados que ocorrem em fluxo. O problema envolvendo tarefas de classificação em fluxo contínuo de dados tem se tornado um dos maiores desafios na área de aprendizagem de máquina nas últimas décadas, pois, como os dados não são conhecidos de antemão, eles devem ser aprendidos à medida que são processados. Além disso, devem ser feitas previsões rápidas a respeito desses dados para dar suporte à decisões muitas vezes tomadas em tempo real. Atualmente, métodos baseados em monitoramento da acurácia de classificação são geralmente usados para detectar explicitamente mudanças nos dados. Entretanto, esses métodos podem tornar-se inviáveis em aplicações práticas, especialmente devido a dois aspectos: a necessidade de uma realimentação do sistema por um operador humano, e a dependência de uma queda significativa da acurácia para que mudanças sejam detectadas. Além disso, a maioria desses métodos é baseada em aprendizagem incremental, onde modelos de predição são atualizados para cada instância de entrada, fato que pode levar a atualizações desnecessárias do sistema. A fim de tentar superar todos esses problemas, nesta tese são propostos dois métodos semi-supervisionados de detecção explícita de mudanças em dados, os quais baseiam-se na estimação e monitoramento de uma métrica de pseudo-erro. O modelo de decisão é atualizado somente após a detecção de uma mudança. No primeiro método proposto, o pseudo-erro é monitorado a partir de métricas de similaridade calculadas entre a distribuição atual e distribuições anteriores dos dados. O segundo método proposto utiliza seleção dinâmica de classificadores para aumentar a precisão do cálculo do pseudo-erro. Como consequência, nosso método possibilita que conjuntos de classificadores online sejam criados a partir de auto-treinamento. Os experimentos apresentaram resultados competitivos quando comparados inclusive com métodos baseados em aprendizagem incremental totalmente supervisionada. A proposta desses dois métodos, especialmente do segundo, é relevante por permitir que tarefas de detecção e reação a mudanças sejam aplicáveis em diversos problemas práticos alcançando altas taxas de acurácia, dado que, na maioria dos problemas práticos, não é possível obter o rótulo de uma instância imediatamente após sua classificação feita pelo sistema.

Styles APA, Harvard, Vancouver, ISO, etc.

10

デイビッド, ア., et David Ha. « Boundary uncertainty-based classifier evaluation ». Thesis, https://doors.doshisha.ac.jp/opac/opac_link/bibid/BB13128126/?lang=0, 2019. https://doors.doshisha.ac.jp/opac/opac_link/bibid/BB13128126/?lang=0.

Texte intégral

Résumé :

種々の分類器を対象として，有限個の学習データのみが利用可能である現実においても理論的に的確で計算量的にも実際的な，分類器性能評価手法を提案する．分類器評価における難しさは，有限データのみの利用に起因する分類誤り推定に伴う偏りの発生にある．この困難を解決するため，「境界曖昧性」と呼ばれる新しい評価尺度を提案し，それを用いる評価法の有用性を，3種の分類器と13個のデータセットを用いた実験を通して実証する．
We propose a general method that makes accurate evaluation of any classifier model for realistic tasks, both in a theoretical sense despite the finiteness of the available data, and in a practical sense in terms of computation costs. The classifier evaluation challenge arises from the bias of the classification error estimate that is only based on finite data. We bypass this existing difficulty by proposing a new classifier evaluation measure called "boundary uncertainty'' whose estimate based on finite data can be considered a reliable representative of its expectation based on infinite data, and demonstrate the potential of our approach on three classifier models and thirteen datasets.
博士(工学)
Doctor of Philosophy in Engineering
同志社大学
Doshisha University

Styles APA, Harvard, Vancouver, ISO, etc.

11

Miranda, Dos Santos Eulanda. « Static and dynamic overproduction and selection of classifier ensembles with genetic algorithms ». Mémoire, École de technologie supérieure, 2008. http://espace.etsmtl.ca/110/1/MIRANDA_DOS_SANTOS_Eulanda.pdf.

Texte intégral

Résumé :

La stratégie de "surproduction et choix" est une approche de sélecfion stafique des ensembles de classificateurs, et elle est divisée en deux étapes: une phase de surproduction et une phase de sélecfion. Cette thèse porte principalement sur l'étude de la phase de sélection, qui constitue le défi le plus important dans la stratégie de surproduction et choix. La phase de sélection est considérée ici comme un problème d'optimisation mono ou multicritère. Conséquemment, le choix de la fonction objectif et de l'algorithme de recherche font l'objet d'une attention particulière dans cette thèse. Les critères étudiés incluent les mesures de diversité, le taux d'erreur et la cardinalité de l'ensemble. L'optimisafion monocritère permet la comparaison objective des mesures de diversité par rapport à la performance globale des ensembles. De plus, les mesures de diversité sont combinées avec le taux d'erreur ou la cardinalité de l'ensemble lors de l'optimisation multicritère. Des résultats expérimentaux sont présentés et discutés. Ensuite, on montre expérimentalement que le surapprentissage est potentiellement présent lors la phase de sélection du meilleur ensemble de classificateurs. Nous proposons une nouvelle méthode pour délecter la présence de surapprentissage durant le processus d'optimisation (phase de sélection). Trois stratégies sont ensuite analysées pour tenter de contrôler le surapprentissage. L'analyse des résultats révèle qu'une stratégie de validation globale doit être considérée pour contrôler le surapprentissage pendant le processus d'optimisation des ensembles de classificateurs. Cette étude a également permis de vérifier que la stratégie globale de validation peut être ufilisée comme outil pour mesurer empiriquement la relation possible entre la diversité et la performance globale des ensembles de classificateurs. Finalement, la plus importante contribufion de cette thèse est la mise en oeuvre d'une nouvelle stratégie pour la sélecfion dynamique des ensembles de classificateurs. Les approches traditionnelles pour la sélecfion des ensembles de classificateurs sont essentiellement stafiques, c'est-à-dire que le choix du meilleur ensemble est définitif et celui-ci servira pour classer tous les exemples futurs. La stratégie de surproduction et choix dynamique proposée dans cette thèse permet la sélection, pour chaque exemple à classer, du sous-ensemble de classificateurs le plus confiant pour décider de la classe d'appartenance. Notre méthode conciHc l'opfimisafion et la sélection dynamique dans une phase de sélection à deux niveaux. L'objectif du premier niveau est de produire une population d'ensembles de classificateurs candidats qui montrent une grande capacité de généralisation, alors que le deuxième niveau se charge de sélecfionner dynamiquement l'ensemble qui présente le degré de cerfitude le plus élevé pour décider de la classe d'appartenance de l'objet à classer. La méthode de sélection dynamique proposée domine les approches conventionnelles (approches statiques) sur les problèmes de reconnaissance de formes étudiés dans le cadre de cette thèse

Styles APA, Harvard, Vancouver, ISO, etc.

12

Chrysostomou, Kyriacos. « The role of classifiers in feature selection : number vs nature ». Thesis, Brunel University, 2008. http://bura.brunel.ac.uk/handle/2438/3038.

Texte intégral

Résumé :

Wrapper feature selection approaches are widely used to select a small subset of relevant features from a dataset. However, Wrappers suffer from the fact that they only use a single classifier when selecting the features. The problem of using a single classifier is that each classifier is of a different nature and will have its own biases. This means that each classifier will select different feature subsets. To address this problem, this thesis aims to investigate the effects of using different classifiers for Wrapper feature selection. More specifically, it aims to investigate the effects of using different number of classifiers and classifiers of different nature. This aim is achieved by proposing a new data mining method called Wrapper-based Decision Trees (WDT). The WDT method has the ability to combine multiple classifiers from four different families, including Bayesian Network, Decision Tree, Nearest Neighbour and Support Vector Machine, to select relevant features and visualise the relationships among the selected features using decision trees. Specifically, the WDT method is applied to investigate three research questions of this thesis: (1) the effects of number of classifiers on feature selection results; (2) the effects of nature of classifiers on feature selection results; and (3) which of the two (i.e., number or nature of classifiers) has more of an effect on feature selection results. Two types of user preference datasets derived from Human-Computer Interaction (HCI) are used with WDT to assist in answering these three research questions. The results from the investigation revealed that the number of classifiers and nature of classifiers greatly affect feature selection results. In terms of number of classifiers, the results showed that few classifiers selected many relevant features whereas many classifiers selected few relevant features. In addition, it was found that using three classifiers resulted in highly accurate feature subsets. In terms of nature of classifiers, it was showed that Decision Tree, Bayesian Network and Nearest Neighbour classifiers caused signficant differences in both the number of features selected and the accuracy levels of the features. A comparison of results regarding number of classifiers and nature of classifiers revealed that the former has more of an effect on feature selection than the latter. The thesis makes contributions to three communities: data mining, feature selection, and HCI. For the data mining community, this thesis proposes a new method called WDT which integrates the use of multiple classifiers for feature selection and decision trees to effectively select and visualise the most relevant features within a dataset. For the feature selection community, the results of this thesis have showed that the number of classifiers and nature of classifiers can truly affect the feature selection process. The results and suggestions based on the results can provide useful insight about classifiers when performing feature selection. For the HCI community, this thesis has showed the usefulness of feature selection for identifying a small number of highly relevant features for determining the preferences of different users.

Styles APA, Harvard, Vancouver, ISO, etc.

13

Oliveira, e. Cruz Rafael Menelau. « Methods for dynamic selection and fusion of ensemble of classifiers ». Universidade Federal de Pernambuco, 2011. https://repositorio.ufpe.br/handle/123456789/2436.

Texte intégral

Résumé :

Made available in DSpace on 2014-06-12T15:58:13Z (GMT). No. of bitstreams: 2 arquivo3310_1.pdf: 8155353 bytes, checksum: 2f4dcd5adb2b0b1a23c40bf343b36b34 (MD5) license.txt: 1748 bytes, checksum: 8a4605be74aa9ea9d79846c1fba20a33 (MD5) Previous issue date: 2011
Faculdade de Amparo à Ciência e Tecnologia do Estado de Pernambuco
Ensemble of Classifiers (EoC) é uma nova alternative para alcançar altas taxas de reconhecimento em sistemas de reconhecimento de padrões. O uso de ensemble é motivado pelo fato de que classificadores diferentes conseguem reconhecer padrões diferentes, portanto, eles são complementares. Neste trabalho, as metodologias de EoC são exploradas com o intuito de melhorar a taxa de reconhecimento em diferentes problemas. Primeiramente o problema do reconhecimento de caracteres é abordado. Este trabalho propõe uma nova metodologia que utiliza múltiplas técnicas de extração de características, cada uma utilizando uma abordagem diferente (bordas, gradiente, projeções). Cada técnica é vista como um sub-problema possuindo seu próprio classificador. As saídas deste classificador são utilizadas como entrada para um novo classificador que é treinado para fazer a combinação (fusão) dos resultados. Experimentos realizados demonstram que a proposta apresentou o melhor resultado na literatura pra problemas tanto de reconhecimento de dígitos como para o reconhecimento de letras. A segunda parte da dissertação trata da seleção dinâmica de classificadores (DCS). Esta estratégia é motivada pelo fato que nem todo classificador pertencente ao ensemble é um especialista para todo padrão de teste. A seleção dinâmica tenta selecionar apenas os classificadores que possuem melhor desempenho em uma dada região próxima ao padrão de entrada para classificar o padrão de entrada. É feito um estudo sobre o comportamento das técnicas de DCS demonstrando que elas são limitadas pela qualidade da região em volta do padrão de entrada. Baseada nesta análise, duas técnicas para seleção dinâmica de classificadores são propostas. A primeira utiliza filtros para redução de ruídos próximos do padrão de testes. A segunda é uma nova proposta que visa extrair diferentes tipos de informação, a partir do comportamento dos classificadores, e utiliza estas informações para decidir se um classificador deve ser selecionado ou não. Experimentos conduzidos em diversos problemas de reconhecimento de padrões demonstram que as técnicas propostas apresentam um aumento de performance significante

Styles APA, Harvard, Vancouver, ISO, etc.

14

Almeida, Paulo Ricardo Lisboa de. « Adapting the dynamic selection of classifiers approach for concept drift scenarios ». reponame:Repositório Institucional da UFPR, 2017. http://hdl.handle.net/1884/52771.

Texte intégral

Résumé :

Orientador : Luiz Eduardo S. de Oliveira
Coorientadores : Alceu de Souza Britto Jr. ; Robert Sabourin
Tese (doutorado) - Universidade Federal do Paraná, Setor de Ciências Exatas, Programa de Pós-Graduação em Informática. Defesa: Curitiba, 09/11/2017
Inclui referências : f. 143-154
Resumo: Muitos ambientes podem sofrer com mudanças nas distribuições ou nas probabilidades a posteriori com o decorrer do tempo, em um problema conhecido como Concept Drift. Nesses cenários, é imperativa a implementação de algum mecanismo para adaptar o sistema de classificação às mudanças no ambiente a fim de minimizar o impacto na acurácia. Em um ambiente estático, é comum a utilização da Seleção Dinâmica de Classificadores (Dynamic Classifier Selection - DCS) para selecionar classificadores/ensembles customizados para cada uma das instâncias de teste de acordo com sua vizinhança em um conjunto de validação, onde a seleção pode ser vista como sendo dependente da região. Neste trabalho, a fim de tratar concept drifts, o conceito geral dos métodos de Seleção Dinâmica de Classificadores é estendido a fim de se tornar não somente dependente de região, mas também dependente do tempo. Através da adição da dependência do tempo, é demonstrado que a maioria dos métodos de Seleção Dinâmica de Classificadores podem ser adaptados para cenários contendo concept drifts, beneficiando-se da dependência de região, já que classificadores treinados em conceitos passados podem, em princípio, se manter competentes no conceito corrente em algumas regiões do espaço de características que não sofreram com mudanças. Neste trabalho a dependência de tempo para os métodos de Seleção Dinâmica é definida de acordo com o tipo de concept drift sendo tratado, que pode afetar apenas a distribuição no espaço de características ou as probabilidades a posteriori. Considerando as adaptações necessárias, o framework Dynse é proposto como uma ferramenta modular capaz de adaptar a Seleção Dinâmica de Classificadores para cenários contendo concept drits. Além disso, uma configuração padrão para o framework é proposta e um protocolo experimental, contendo sete Métodos de Seleção Dinâmica e doze problemas envolvendo concept drifts com diferentes propriedades, mostra que a Seleção Dinâmica de Classificadores pode ser adaptada para diferentes cenários contendo concept drifts. Quando comparado ao estado da arte, o framework Dynse, através da Seleção Dinâmica de Classificadores, se sobressai principalmente em termos de estabilidade. Ou seja, o método apresenta uma boa performance na maioria dos cenários, e requer quase nenhum ajuste de parâmetros. Key-words: Reconhecimento de Padrões. Concept Drift. Concept Drift Virtual. Concept Drift Real. Conjunto de Classificadores. Seleção Dinâmica de Classificadores. Acurácia Local.
Abstract: Many environments may suffer from distributions or a posteriori probabilities changes over time, leading to a phenomenon known as concept drift. In these scenarios, it is crucial to implement a mechanism to adapt the classification system to the environment changes in order to minimize any accuracy loss. Under a static environment, a popular approach consists in using a Dynamic Classifier Selection (DCS)-based method to select a custom classifier/ensemble for each test instance according to its neighborhood in a validation set, where the selection can be considered region-dependent. In order to handle concept drifts, in this work the general idea of the DCS method is extended to be also time-dependent. Through this time-dependency, it is demonstrated that most neighborhood DCS-based methods can be adapted to handle concept drift scenarios and take advantage of the region-dependency, since classifiers trained under previous concepts may still be competent in some regions of the feature space. The time-dependency for the DCS methods is defined according to the concept drift nature, which may define if the changes affects the a posteriori probabilities or the distributions only. By taking the necessary modifications, the Dynse framework is proposed in this work as a modular tool capable of adapting the DCS approach to concept drift scenarios. A default configuration for the Dynse framework is proposed and an experimental protocol, containing seven well-known DCS methods and 12 concept drift problems with different properties, shows that the DCS approach can adapt to different concept drift scenarios. When compared to state-of-the-art concept drift methods, the DCS-based approach comes out ahead in terms of stability, i.e., it performs well in most cases, and requires almost no parameter tuning. Key-words: Pattern Recognition. Concept Drift. Virtual Concept Drift. Real Concept Drift. Ensemble. Dynamic Classifier Selection. Local Accuracy.

Styles APA, Harvard, Vancouver, ISO, etc.

15

Samet, Asma. « Classifier ensemble under the belief function framework ». Thesis, Artois, 2018. http://www.theses.fr/2018ARTO0203.

Texte intégral

Résumé :

Dans cette thèse, nous nous intéressons au problème de construction d’ensemble de classifieurs pour le traitement de données incertaines, plus particulièrement les données modélisées avec la théorie des fonctions de croyance. Dans un premier temps, nous introduisons de nouveaux algorithmes d’apprentissage dans le cadre évidentiel. Par la suite, nous abordons le processus de construction d’ensemble de classifieurs qui se fonde sur deux étapes importantes : la sélection des classifieurs individuels et la fusion des classifieurs. Pour l’étape de sélection, la diversité entre les classifieurs individuels est l’un des critères importants qui influe sur la performance de l’ensemble et peut être assurée en entrainant les classifieurs avec des sous-ensembles d’attributs divers. Ainsi, nous proposons une nouvelle approche permettant l’extraction de sous-ensembles d’attributs à partir des données décrites par des attributs évidentiels. Nous nous reposons sur la théorie des ensembles approximatifs (rough set theory en anglais) pour identifier les différents sous-ensembles d’attributs minimaux (reducts en anglais) permettant la même discrimination que l’ensemble des attributs initiaux. Nous développons ensuite trois méthodes permettant la sélection des reducts les plus appropriés pour un système d’ensemble. Une évaluation de ces trois méthodes de sélection a été effectuée et la meilleure méthode a été utilisée pour la sélection des classifeurs individuels. Pour la phase de fusion, nous proposons de sélectionner l’opérateur de fusion le plus approprié parmi les règles les plus connues à savoir la règle de combinaison de Dempster, la règle prudente et la règle t-norm optimisée
The work presented in this Thesis concerns the construction of ensemble classifiers for addressing uncertain data, precisely data with evidential attributes. We start by developing newest machine learning classifiers within an evidential environment and then we tackle the ensemble construction process which follows two important steps: base individual classifier selection and classifier combination. Regarding the selection step, diversity between the base individual classifiers is one among the important criteria impacting the ensemble performance and it can be achieved by training the base classifiers on diverse feature subspaces. Thus, we propose a novel framework for feature subspace extraction form data with evidential attributes. We mainly relied on the rough set theory for identifying all possible minimal feature subspaces, called reducts, allowing the same discrimination as the whole feature set. Then, we develop three methods enabling the selection of the most suitable diverse reducts for an ensemble of evidential classifiers. The proposed reduct selection methods are evaluated according to several assessment criteria and the best one is used for selecting the best individual classifiers. Concerning the integration level, we propose to select the most appropriate combination operator among some well-known ones, including the Dempster, the cautious and the optimized t-norm based rules

Styles APA, Harvard, Vancouver, ISO, etc.

16

Haning, Jacob M. « Feature Selection for High-Dimensional Individual and Ensemble Classifiers with Limited Data ». University of Cincinnati / OhioLINK, 2014. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1406810947.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

17

Ala'raj, Maher A. « A credit scoring model based on classifiers consensus system approach ». Thesis, Brunel University, 2016. http://bura.brunel.ac.uk/handle/2438/13669.

Texte intégral

Résumé :

Managing customer credit is an important issue for each commercial bank; therefore, banks take great care when dealing with customer loans to avoid any improper decisions that can lead to loss of opportunity or financial losses. The manual estimation of customer creditworthiness has become both time- and resource-consuming. Moreover, a manual approach is subjective (dependable on the bank employee who gives this estimation), which is why devising and implementing programming models that provide loan estimations is the only way of eradicating the ‘human factor’ in this problem. This model should give recommendations to the bank in terms of whether or not a loan should be given, or otherwise can give a probability in relation to whether the loan will be returned. Nowadays, a number of models have been designed, but there is no ideal classifier amongst these models since each gives some percentage of incorrect outputs; this is a critical consideration when each percent of incorrect answer can mean millions of dollars of losses for large banks. However, the LR remains the industry standard tool for credit-scoring models development. For this purpose, an investigation is carried out on the combination of the most efficient classifiers in credit-scoring scope in an attempt to produce a classifier that exceeds each of its classifiers or components. In this work, a fusion model referred to as ‘the Classifiers Consensus Approach’ is developed, which gives a lot better performance than each of single classifiers that constitute it. The difference of the consensus approach and the majority of other combiners lie in the fact that the consensus approach adopts the model of real expert group behaviour during the process of finding the consensus (aggregate) answer. The consensus model is compared not only with single classifiers, but also with traditional combiners and a quite complex combiner model known as the ‘Dynamic Ensemble Selection’ approach. As a pre-processing technique, step data-filtering (select training entries which fits input data well and remove outliers and noisy data) and feature selection (remove useless and statistically insignificant features which values are low correlated with real quality of loan) are used. These techniques are valuable in significantly improving the consensus approach results. Results clearly show that the consensus approach is statistically better (with 95% confidence value, according to Friedman test) than any other single classifier or combiner analysed; this means that for similar datasets, there is a 95% guarantee that the consensus approach will outperform all other classifiers. The consensus approach gives not only the best accuracy, but also better AUC value, Brier score and H-measure for almost all datasets investigated in this thesis. Moreover, it outperformed Logistic Regression. Thus, it has been proven that the use of the consensus approach for credit-scoring is justified and recommended in commercial banks. Along with the consensus approach, the dynamic ensemble selection approach is analysed, the results of which show that, under some conditions, the dynamic ensemble selection approach can rival the consensus approach. The good sides of dynamic ensemble selection approach include its stability and high accuracy on various datasets. The consensus approach, which is improved in this work, may be considered in banks that hold the same characteristics of the datasets used in this work, where utilisation could decrease the level of mistakenly rejected loans of solvent customers, and the level of mistakenly accepted loans that are never to be returned. Furthermore, the consensus approach is a notable step in the direction of building a universal classifier that can fit data with any structure. Another advantage of the consensus approach is its flexibility; therefore, even if the input data is changed due to various reasons, the consensus approach can be easily re-trained and used with the same performance.

Styles APA, Harvard, Vancouver, ISO, etc.

18

Muhammad, Hanif Shehzad. « Feature selection and classifier combination : Application to the extraction of textual information in scene images ». Paris 6, 2009. http://www.theses.fr/2009PA066521.

Texte intégral

Résumé :

Dans cette thèse, nous avons traité le problème de la détection et de la localisation dans les images de scène. Notre système est composé de deux parties : le Détecteur de texte et le Localiseur de texte. Le détecteur de texte (une cascade de classifieurs boostés) emploie la méthode de dopage qui sélectionne et combine des descripteurs et des classifieurs faibles pertinents. Plus précisément, nous avons proposé une version régularisée de l’algorithme AdaBoost qui intègre la complexité (liée à la charge de calcul) des descripteurs et des classifieurs faibles dans la phase de sélection. Nous avons proposé des descripteurs hétérogènes pour coder l’information textuelle dans les images. Nos règles de classification appartiennent des différentes classes de classifieurs : discriminant, linéaire et non-linéaire, paramétrique et non-paramétrique. Le détecteur génère des régions candidates de texte qui servent d’entrées au localiseur de texte dont l’objectif est de trouver des rectangles englobants, autour des mots ou des lignes de texte dans l’image. Les résultats sur deux bases d’images difficiles montrent l’efficacité de notre approche.

Styles APA, Harvard, Vancouver, ISO, etc.

19

Al-Ani, Ahmed Karim. « An improved pattern classification system using optimal feature selection, classifier combination, and subspace mapping techniques ». Thesis, Queensland University of Technology, 2002.

Trouver le texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

20

Lima, Tiago Pessoa Ferreira de. « An authomatic method for construction of multi-classifier systems based on the combination of selection and fusion ». Universidade Federal de Pernambuco, 2013. https://repositorio.ufpe.br/handle/123456789/12457.

Texte intégral

Résumé :

Submitted by João Arthur Martins (joao.arthur@ufpe.br) on 2015-03-12T17:38:41Z No. of bitstreams: 2 Dissertaçao Tiago de Lima.pdf: 1469834 bytes, checksum: 95a0326778b3d0f98bd35a7449d8b92f (MD5) license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5)
Approved for entry into archive by Daniella Sodre (daniella.sodre@ufpe.br) on 2015-03-13T14:23:38Z (GMT) No. of bitstreams: 2 Dissertaçao Tiago de Lima.pdf: 1469834 bytes, checksum: 95a0326778b3d0f98bd35a7449d8b92f (MD5) license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5)
Made available in DSpace on 2015-03-13T14:23:38Z (GMT). No. of bitstreams: 2 Dissertaçao Tiago de Lima.pdf: 1469834 bytes, checksum: 95a0326778b3d0f98bd35a7449d8b92f (MD5) license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5) Previous issue date: 2013-02-26
In this dissertation, we present a methodology that aims the automatic construction of multi-classifiers systems based on the combination of selection and fusion. The presented method initially finds an optimum number of clusters for training data set and subsequently determines an ensemble for each cluster found. For model evaluation, the testing data set are submitted to clustering techniques and the nearest cluster to data input will emit a supervised response through its associated ensemble. Self-organizing maps were used in the clustering phase and multilayer perceptrons were used in the classification phase. Adaptive differential evolution has been used in this work in order to optimize the parameters and performance of the different techniques used in the classification and clustering phases. The proposed method, called SFJADE - Selection and Fusion (SF) via Adaptive Differential Evolution (JADE), has been tested on data compression of signals generated by artificial nose sensors and well-known classification problems, including cancer, card, diabetes, glass, heart, horse, soybean and thyroid. The experimental results have shown that the SFJADE method has a better performance than some literature methods while significantly outperforming most of the methods commonly used to construct Multi-Classifier Systems.
Nesta dissertação, nós apresentamos uma metodologia que almeja a construção automática de sistemas de múltiplos classificadores baseados em uma combinação de seleção e fusão. O método apresentado inicialmente encontra um número ótimo de grupos a partir do conjunto de treinamento e subsequentemente determina um comitê para cada grupo encontrado. Para avaliação do modelo, os dados de teste são submetidos à técnica de agrupamento e o grupo mais próximo do dado de entrada irá emitir uma resposta supervisionada por meio de seu comitê associado. Mapas Auto Organizáveis foi usado na fase de agrupamento e Perceptrons de múltiplas camadas na fase de classificação. Evolução Diferencial Adaptativa foi utilizada neste trabalho a fim de otimizar os parâmetros e desempenho das diferentes técnicas utilizadas nas fases de classificação e agrupamento de dados. O método proposto, chamado SFJADE – Selection and Fusion (SF) via Adaptive Differential Evolution (JADE), foi testado em dados gerados para sensores de um nariz artificial e problemas de referência em classificação de padrões, que são: cancer, card, diabetes, glass, heart, heartc e horse. Os resultados experimentais mostraram que SFJADE possui um melhor desempenho que alguns métodos da literatura, além de superar a maioria dos métodos geralmente usados para a construção de sistemas de múltiplos classificadores.

Styles APA, Harvard, Vancouver, ISO, etc.

21

Ganapathy, Priya. « Development and Evaluation of a Flexible Framework for the Design of Autonomous Classifier Systems ». Wright State University / OhioLINK, 2009. http://rave.ohiolink.edu/etdc/view?acc_num=wright1261335392.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

22

Faria, Fabio Augusto 1983. « A framework for pattern classifier selection and fusion = Um arcabouço para seleção e fusão de classificadores de padrão ». [s.n.], 2014. http://repositorio.unicamp.br/jspui/handle/REPOSIP/275503.

Texte intégral

Résumé :

Orientadores: Ricardo da Silva Torres, Anderson Rocha
Tese (doutorado) - Universidade Estadual de Campinas, Instituto de Computação
Made available in DSpace on 2018-08-24T22:15:52Z (GMT). No. of bitstreams: 1 Faria_FabioAugusto_D.pdf: 5657546 bytes, checksum: 5b95fa0f8a5653e7b13d8895cde208f1 (MD5) Previous issue date: 2014
Resumo: O crescente aumento de dados visuais, seja pelo uso de inúmeras câmeras de vídeo monitoramento disponíveis ou pela popularização de dispositivos móveis que permitem pessoas criar, editar e compartilhar suas próprias imagens/vídeos, tem contribuído enormemente para a chamada ''big data revolution". Esta grande quantidade de dados visuais dá origem a uma caixa de Pandora de novos problemas de classificação visuais nunca antes imaginados. Tarefas de classificação de imagens e vídeos foram inseridos em diferentes e complexas aplicações e o uso de soluções baseadas em aprendizagem de máquina tornou-se mais popular para diversas aplicações. Entretanto, por outro lado, não existe uma ''bala de prata" que resolva todos os problemas, ou seja, não é possível caracterizar todas as imagens de diferentes domínios com o mesmo método de descrição e nem utilizar o mesmo método de aprendizagem para alcançar bons resultados em qualquer tipo de aplicação. Nesta tese, propomos um arcabouço para seleção e fusão de classificadores. Nosso método busca combinar métodos de caracterização de imagem e aprendizagem por meio de uma abordagem meta-aprendizagem que avalia quais métodos contribuem melhor para solução de um determinado problema. O arcabouço utiliza três diferentes estratégias de seleção de classificadores para apontar o menos correlacionados e eficazes, por meio de análises de medidas de diversidade. Os experimentos mostram que as abordagens propostas produzem resultados comparáveis aos famosos métodos da literatura para diferentes aplicações, utilizando menos classificadores e não sofrendo com problemas que afetam outras técnicas como a maldição da dimensionalidade e normalização. Além disso, a nossa abordagem é capaz de alcançar resultados eficazes de classificação usando conjuntos de treinamento muito reduzidos
Abstract: The frequent growth of visual data, either by countless available monitoring video cameras or the popularization of mobile devices that allow each person to create, edit, and share their own images and videos have contributed enormously to the so called ''big-data revolution''. This shear amount of visual data gives rise to a Pandora box of new visual classification problems never imagined before. Image and video classification tasks have been inserted in different and complex applications and the use of machine learning-based solutions has become the most popular approach to several applications. Notwithstanding, there is no silver bullet that solves all the problems, i.e., it is not possible to characterize all images of different domains with the same description method nor is it possible to use the same learning method to achieve good results in any kind of application. In this thesis, we aim at proposing a framework for classifier selection and fusion. Our method seeks to combine image characterization and learning methods by means of a meta-learning approach responsible for assessing which methods contribute more towards the solution of a given problem. The framework uses three different strategies of classifier selection which pinpoints the less correlated, yet effective, classifiers through a series of diversity measure analysis. The experiments show that the proposed approaches yield comparable results to well-known algorithms from the literature on many different applications but using less learning and description methods as well as not incurring in the curse of dimensionality and normalization problems common to some fusion techniques. Furthermore, our approach is able to achieve effective classification results using very reduced training sets
Doutorado
Ciência da Computação
Doutor em Ciência da Computação

Styles APA, Harvard, Vancouver, ISO, etc.

23

Zoghi, Zeinab. « Ensemble Classifier Design and Performance Evaluation for Intrusion Detection Using UNSW-NB15 Dataset ». University of Toledo / OhioLINK, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=toledo1596756673292254.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

24

Zhang, Qing Frankowski Ralph. « An empirical evaluation of the random forests classifier models for variable selection in a large-scale lung cancer case-control study / ». See options below, 2006. http://proquest.umi.com/pqdweb?did=1324365481&sid=1&Fmt=2&clientId=68716&RQT=309&VName=PQD.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

25

He, Jeannie. « Automatic Diagnosis of Parkinson’s Disease Using Machine Learning : A Comparative Study of Different Feature Selection Algorithms, Classifiers and Sampling Methods ». Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-301616.

Texte intégral

Résumé :

Over the past few years, several studies have been published to propose algorithms for the automated diagnosis of Parkinson’s Disease using simple exams such as drawing and voice exams. However, at the same time as all classifiers appear to have been outperformed by another classifier in at least one study, there appear to lack a study on how well different classifiers work with a certain feature selection algorithm and sampling method. More importantly, there appear to lack a study that compares the proposed feature selection algorithm and/or sampling method with a baseline that does not involve any feature selection or oversampling. This leaves us with the question of which combination of feature selection algorithm, sampling method and classifier is the best as well as what impact feature selection and oversampling may have on the performance. Given the importance of providing a quick and accurate diagnosis of Parkinson’s disease, a comparison is made between different systems of classifier, feature selection and sampling method with a focus on the predictive performance. A system was chosen as the best system for the diagnosis of Parkinson’s disease based on its comparative predictive performance on two sets of data - one from drawing exams and one from voice exams.
Som en av världens mest vanligaste sjukdom med en tendens att leda till funktionshinder har Parkinsons sjukdom länge varit i centrum av forskning. För att se till att så många som möjligt får en behandling innan det blir för sent har flera studier publicerats för att föreslå algoritmer för automatisk diagnos av Parkinsons sjukdom. Samtidigt som alla klassificerare verkar ha överträffats av en annan klassificerare i minst en studie, verkar det saknas en studie om hur väl olika klassificerare fungerar med en viss kombination av urvalsalgoritm (feature selection algorithm på engelska) och provtagningsmetod. Därutöver verkar det saknas en studie där resultatet från den föreslagna urvalsalgoritmen och/eller samplingsmetoden jämförs med resultatet av att applicera klassificeraren direkt på datan utan någon urvalsalgoritm eller resampling. Detta lämnar oss en fråga om vilket system av klassificerare, urvalsalgoritm och samplingsmetod man bör välja och ifall det är värt att använda en urvalsalgoritm och överprovtagningsmetod. Med tanke på vikten av att snabbt och noggrant upptäcka Parkinsons sjukdom har en jämförelse gjorts för att hitta den bästa kombinationen av klassificerare, urvalsalgoritm och provtagningsalgoritm för den automatiska diagnosen av Parkinsons sjukdom.

Styles APA, Harvard, Vancouver, ISO, etc.

26

Marin, Rodenas Alfonso. « Comparison of Automatic Classifiers’ Performances using Word-based Feature Extraction Techniques in an E-government setting ». Thesis, KTH, Skolan för informations- och kommunikationsteknik (ICT), 2011. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-32363.

Texte intégral

Résumé :

Nowadays email is commonly used by citizens to establish communication with their government. On the received emails, governments deal with some common queries and subjects which some handling officers have to manually answer. Automatic email classification of the incoming emails allows to increase the communication efficiency by decreasing the delay between the query and its response. This thesis takes part within the IMAIL project, which aims to provide an automatic answering solution to the Swedish Social Insurance Agency (SSIA) (“Försäkringskassan” in Swedish). The goal of this thesis is to analyze and compare the classification performance of different sets of features extracted from SSIA emails on different automatic classifiers. The features extracted from the emails will depend on the previous preprocessing that is carried out as well. Compound splitting, lemmatization, stop words removal, Part-of-Speech tagging and Ngrams are the processes used in the data set. Moreover, classifications will be performed using Support Vector Machines, k- Nearest Neighbors and Naive Bayes. For the analysis and comparison of different results, precision, recall and F-measure are used. From the results obtained in this thesis, SVM provides the best classification with a F-measure value of 0.787. However, Naive Bayes provides a better classification for most of the email categories than SVM. Thus, it can not be concluded whether SVM classify better than Naive Bayes or not. Furthermore, a comparison to Dalianis et al. (2011) is made. The results obtained in this approach outperformed the results obtained before. SVM provided a F-measure value of 0.858 when using PoS-tagging on original emails. This result improves by almost 3% the 0.83 obtained in Dalianis et al. (2011). In this case, SVM was clearly better than Naive Bayes.

Styles APA, Harvard, Vancouver, ISO, etc.

27

Milne, Linda Computer Science &amp Engineering Faculty of Engineering UNSW. « Machine learning for automatic classification of remotely sensed data ». Publisher:University of New South Wales. Computer Science & ; Engineering, 2008. http://handle.unsw.edu.au/1959.4/41322.

Texte intégral

Résumé :

As more and more remotely sensed data becomes available it is becoming increasingly harder to analyse it with the more traditional labour intensive, manual methods. The commonly used techniques, that involve expert evaluation, are widely acknowledged as providing inconsistent results, at best. We need more general techniques that can adapt to a given situation and that incorporate the strengths of the traditional methods, human operators and new technologies. The difficulty in interpreting remotely sensed data is that often only a small amount of data is available for classification. It can be noisy, incomplete or contain irrelevant information. Given that the training data may be limited we demonstrate a variety of techniques for highlighting information in the available data and how to select the most relevant information for a given classification task. We show that more consistent results between the training data and an entire image can be obtained, and how misclassification errors can be reduced. Specifically, a new technique for attribute selection in neural networks is demonstrated. Machine learning techniques, in particular, provide us with a means of automating classification using training data from a variety of data sources, including remotely sensed data and expert knowledge. A classification framework is presented in this thesis that can be used with any classifier and any available data. While this was developed in the context of vegetation mapping from remotely sensed data using machine learning classifiers, it is a general technique that can be applied to any domain. The emphasis of the applicability for this framework being domains that have inadequate training data available.

Styles APA, Harvard, Vancouver, ISO, etc.

28

Watts-Willis, Tristan A. « Autonomous model selection for surface classification via unmanned aerial vehicle ». Scholarly Commons, 2017. https://scholarlycommons.pacific.edu/uop_etds/224.

Texte intégral

Résumé :

In the pursuit of research in remote areas, robots may be employed to deploy sensor networks. These robots need a method of classifying a surface to determine if it is a suitable installation site. Developing surface classification models manually requires significant time and detracts from the goal of automating systems. We create a system that automatically collects the data using an Unmanned Aerial Vehicle (UAV), extracts features, trains a large number of classifiers, selects the best classifier, and programs the UAV with that classifier. We design this system with user configurable parameters for choosing a high accuracy, efficient classifier. In support of this system, we also develop an algorithm for evaluating the effectiveness of individual features as indicators of the variable of interest. Motivating our work is a prior project that manually developed a surface classifier using an accelerometer; we replicate those results with our new automated system and improve on those results, providing a four-surface classifier with a 75% classification rate and a hard/soft classifier with a 100% classification rate. We further verify our system through a field experiment that collects and classifies new data, proving its end-to-end functionality. The general form of our system provides a valuable tool for automation of classifier creation and is released as an open-source tool.

Styles APA, Harvard, Vancouver, ISO, etc.

29

Wong, Kwok Wai Johnny. « Development of selection evaluation and system intelligence analytic models for the intelligent building control systems ». Thesis, The Hong Kong Polytechnic University, 2007. https://eprints.qut.edu.au/20343/1/c20343.pdf.

Texte intégral

Résumé :

With the availability of innumerable ‘intelligent’ building products and the dearth of inclusive evaluation tools, design teams are confronted with the quandary of choosing the apposite building control systems to suit the needs of a particular intelligent building project. The paucity of measures that represent the degree of system intelligence and indicate the desirable goal in intelligent building control systems design further inhibits the consumers from comparing numerous products from the viewpoint of intelligence. This thesis is organised respectively to develop models for facilitating the selection evaluation and the system intelligence analysis for the seven predominant building control systems in the intelligent building. To achieve these objectives, systematic research activities are conducted to first develop, test and refine the general conceptual models using consecutive surveys; then, to convert the developed conceptual frameworks to the practical models; and, finally, to evaluate the effectiveness of the practical models by means of expert validations.----- The findings of this study, on one hand, suggest that there are different sets of critical selection criteria (CSC) affecting the selection decision of the intelligent building control systems. Service life, and operating and maintenance costs are perceived as two common CSC. The survey results generally reflect that an ‘intelligent’ building control system does not necessarily need to be technologically advanced. Instead, it should be the one that can ensure efficiency and enhance user comfort and cost effectiveness. On the other hand, the findings of the research on system intelligence suggest that each building control system has a distinctive set of intelligence attributes and indicators. The research findings also indicate that operational benefits of the intelligent building exert a considerable degree of influence on the relative importance of intelligence indicators of the building control systems in the models. This research not only presents a systematic and structured approach to evaluate candidate building control systems against the CSC, but it also suggests a benchmark to measure the degree of intelligence of one control system candidate against another.

Styles APA, Harvard, Vancouver, ISO, etc.

30

Duan, Cheng. « Imbalanced Data Classiﬁcation with the K-Closest Resemblance Classiﬁer for Remote Sensing and Social Media Texts ». Thesis, Université d'Ottawa / University of Ottawa, 2020. http://hdl.handle.net/10393/41424.

Texte intégral

Résumé :

Data imbalance has been a challenge in many areas of automatic classiﬁcation. Many popular approaches including over-sampling, under-sampling, and Synthetic Minority Oversampling Technique (SMOTE) have been developed and tested in previous research. A big problem with these techniques is that they try to solve the problem by modifying the original data rather than truly overcome the imbalance and let the classiﬁers learn. For tasks in areas like remote sensing and depression detection, the imbalanced data challenge also exists. Researchers have made eﬀorts to overcome the challenge by adopting methods at the data pre-processing step. However, in remote sensing and depression detection tasks, the main interest is still on applying diﬀerent new classiﬁers such as deep learning which has powerful classiﬁcation ability but still do not consider data imbalance as prime factor of lower classiﬁcation performance. In this thesis, we demonstrate the performance of K-CR in our evaluation experiments on a urban land cover classiﬁcation dataset and on two depression detection datasets. The latter two datasets consist in social media texts (tweets), therefore we propose to adopt a feature selection technique Term Frequency - Category-Based Term Weights (TF-CBTW) and various word embedding techniques (Word2Vec, FastText, GloVe, and language model BERT). This feature selection method was not applied before in similar settings and we show that it helps to improve the eﬃciency and the results of the K-CR classiﬁer. Our three experiments show that K-CR can achieve comparable performance on the majority classes and better performance on minority classes when compared to other classiﬁers such as Random Forest, K-Nearest Neighbour, Support Vector Machines, Multi-layer Perception, Convolutional Neural Networks, and Long Short-Term Memory.

Styles APA, Harvard, Vancouver, ISO, etc.

31

Chang, Liang-Hao, et 張良豪. « Improving the performance of Naive Bayes Classifier by using Selective Naive Bayesian Algorithm and Prior Distributions ». Thesis, 2009. http://ndltd.ncl.edu.tw/handle/92613736217287175606.

Texte intégral

Résumé :

碩士
國立成功大學
工業與資訊管理學系碩博士班
97
Naive Bayes classifiers have been widely used for data classification because of its computational efficiency and competitive accuracy. When all attributes are employed for classification, the accuracy of the naive Bayes classifier is generally affected by noisy attributes. A mechanism for attribute selection should be considered for improving its prediction accuracy. Selective naive Bayesian method is a very successful approach for removing noisy and/or redundant attributes. In addition, attributes are generally assumed to have prior distributions, such as Dirichlet or generalized Dirichlet distributions, for achieving a higher prediction accuracy. Many studies have proposed the methods for finding the best priors for attributes, but none of them takes attribute selection into account. Thus, this thesis proposes two models for combining prior distribution and feature selection together for increasing the accuracy of the naive Bayes classifier. Model I finds out the best prior for each attribute after all attributes have been determined by the selective naive Bayesian algorithm. Model II finds the best prior of the newest attribute determined by the selective naive Bayesian algorithm when all predecessors of the newest attribute have their best priors. The experimental result on 17 data sets form UCI data repository shows that Model I with the general Dirichlet prior generally and consistently achieves a higher classification accuracy.

Styles APA, Harvard, Vancouver, ISO, etc.

32

Bonnie, Ching-yi Lin. « A Production Experiment of Mandarin Classifier Selection ». 2001. http://www.cetd.com.tw/ec/thesisdetail.aspx?etdun=U0021-2603200719114250.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

33

Lin, Bonnie Ching-yi, et 林靜怡. « A Production Experiment of Mandarin Classifier Selection ». Thesis, 2002. http://ndltd.ncl.edu.tw/handle/35300853562939351707.

Texte intégral

Résumé :

碩士
國立臺灣師範大學
英語研究所
90
ABSTRACT The Chinese classifier system has always been an intriguing and interesting topic under discussion. In this study, we focus on the classifier selection of Mandarin Chinese speakers. We discuss three potential factors underlying the Mandarin classifier selection─semantic relation between classifiers and the following nouns, the syntactic environment where classifiers occur, and physical traits of the target objects. In the first part of the study, we specifically examine classifiers after numerals. The results indicate that when the semantic content of a particular classifier is close to the following noun, this classifier is more likely to be preserved. As the semantic relation between a noun and a classifier gets more distant, the classifier tends to be either neutralized to a general classifier ge or substituted to another specific classifier which has certain overlapping semantic feature with the original classifier. The second part of this thesis deals with classifier selection after demonstratives. We compare the neutralization of classifiers after numerals (with the data we obtained in the first part of the study) and demonstratives. The result shows that classifiers occurring after demonstratives are neutralized more often than those after numerals, and the difference reaches statistic significance. The last part of this research investigates the conceptual mechanism underlying classifier selection. With the change of physical traits of the same target objects, we expect subjects to react differently and choose different classifiers according to the most salient perceptual feature of the two pictures (of the same target object). However, the result is not as expected. Subjects seem not to be influenced by the change of shapes, sizes, etc., they nevertheless tend to choose the classifier in their lexicon that collocates with a particular noun most frequently. That is, collocation frequency seems to play a bigger role than conception in classifier selection.

Styles APA, Harvard, Vancouver, ISO, etc.

34

Li, Jia-ling, et 李佳玲. « ACO-based Feature Selection and Classifier Parameter Optimization ». Thesis, 2008. http://ndltd.ncl.edu.tw/handle/943bta.

Texte intégral

Résumé :

碩士
國立高雄第一科技大學
資訊管理所
96
Support Vector Machines (SVM) is one of the new techniques for pattern classification. The kernel parameter settings for SVM in training process can impact on the classification accuracy. A proper feature subset can also improve the classification efficiency and accuracy. This study hybridized the SVM with ant colony optimization (ACO) to simultaneously optimize the kernel parameters and feature subset without degrading the classification accuracy. Using the feature importance and pheromones information to determine the transition probability. Using the classification accuracy and the weight vector of the feature provided by the SVM classifier are both to update the pheromone information. The experimental results of five datasets showed that the proposed approach can successfully reduce data dimensions and maintain the classification accuracy.

Styles APA, Harvard, Vancouver, ISO, etc.

35

Wu, Pei-Tzu, et 吳珮慈. « Cancer Classifier – Feature Selection and Gene Feature Combinations ». Thesis, 2010. http://ndltd.ncl.edu.tw/handle/77421859242670457770.

Texte intégral

Résumé :

碩士
國立交通大學
資訊科學與工程研究所
98
Breast cancer is the main cause of death for women. Many researchers dedicate to the investigation of cancer classifications, attempting to find malignant tumors and directing therapies in early stages. Therefore, we used feature selection methods and ensemble classifier models to identify and predict on breast cancer classifications. The diagnostic data of breast cancer provide informative and significant knowledge for cancerous classifications. Thus, we apply feature selection technique to retrieve and rank the importance of attributes. Use the attributes we obtained to classify by diversifying of attribute combinations. The study used breast cancer datasets, K-nearest neighbor, Quadratic Classifier, Support Vector Machine classification of individual classifier, ensemble models and combined model to classify. The goal is to construct an efficient classification model to improve the performance of accuracy and to obtain the most significant features identifying the malignant breast cancer.

Styles APA, Harvard, Vancouver, ISO, etc.

36

Yu, Cheng-Lin, et 余晟麟. « A Generalized Image Classifier based on Feature Selection ». Thesis, 2015. http://ndltd.ncl.edu.tw/handle/64867408414754377176.

Texte intégral

Résumé :

碩士
國立臺灣師範大學
資訊工程學系
103
Establishing an image classification system traditionally requires a series of complex procedures, including collecting training samples, feature extraction, training model and accuracy analysis. In general terms, the established image classification system should only be used to identify images of specific topics. The reason is that the system can apply the characteristics of knowledge within a specific image domain to train a model, which leads to higher accuracy. Most of the image classification methods of the earlier studies focus on specific domains, and the proposed method of the current research is otherwise that we do not specify the image domain in advance, while the image classification system can still be established. Regarding the actual application, it is not easy to collect the training images, and therefore the provided training samples are insufficient. We have built an image classifier with a small number of training samples and extracted numerous features of every variety. By so doing, the classifier is equipped with the ability to present images of different topics. To create a general classifier that can function without the need to identify a certain image domain, SVM classifier and F-score feature selection method are combined, and within the field of image classification, a specific feature has been selected to satisfy facilitate the classification tasks.

Styles APA, Harvard, Vancouver, ISO, etc.

37

Skalak, David Bingham. « Prototype selection for composite nearest neighbor classifiers ». 1997. https://scholarworks.umass.edu/dissertations/AAI9737585.

Texte intégral

Résumé :

Combining the predictions of a set of classifiers has been shown to be an effective way to create composite classifiers that are more accurate than any of the component classifiers. Increased accuracy has been shown in a variety of real-world applications, ranging from protein sequence identification to determining the fat content of ground meat. Despite such individual successes, the answers are not known to fundamental questions about classifier combination, such as "Can classifiers from any given model class be combined to create a composite classifier with higher accuracy?" or "Is it possible to increase the accuracy of a given classifier by combining its predictions with those of only a small number of other classifiers?". The goal of this dissertation is to provide answers to these and closely related questions with respect to a particular model class, the class of nearest neighbor classifiers. We undertake the first study that investigates in depth the combination of nearest neighbor classifiers. Although previous research has questioned the utility of combining nearest neighbor classifiers, we introduce algorithms that combine a small number of component nearest neighbor classifiers, where each of the components stores a small number of prototypical instances. In a variety of domains, we show that these algorithms yield composite classifiers that are more accurate than a nearest neighbor classifier that stores all training instances as prototypes. The research presented in this dissertation also extends previous work on prototype selection for an independent nearest neighbor classifier. We show that in many domains, storing a very small number of prototypes can provide classification accuracy greater than or equal to that of a nearest neighbor classifier that stores all training instances. We extend previous work by demonstrating that algorithms that rely primarily on random sampling can effectively choose a small number of prototypes.

Styles APA, Harvard, Vancouver, ISO, etc.

38

Hwang, Cheng-wei, et 黃政偉. « A Neural Network Document Classifier with Linguistic Feature Selection ». Thesis, 1999. http://ndltd.ncl.edu.tw/handle/11955938873619510783.

Texte intégral

Résumé :

碩士
國立臺灣科技大學
電子工程系
87
In this paper, a neural network document classifier with linguistic feature selection and multi-category output is presented. The proposed classifier is capable of classifying documents that are unstructured and contain linguistic description. It consists of a feature selection unit and a hierarchical neural network classification unit. In feature selection unit, we extract terms from original documents by text processing, then we analyze the conformity and uniformity of each term by entropy function which is characterized to measure the significance. Terms with high significance will be selected as input features for the following classifiers. To reduce the input dimension, we perform a mechanism to merge synonyms. According to the uniformity analysis, we obtain a term similarity matrix by fuzzy relation operation and then construct a synonym thesaurus. As a result, synonyms can be grouped. In hierarchical neural network classification unit, we adopt the well-known back-propagation model to build this proper hierarchical classification unit. In our experiment, a product description database from an electronic commercial company is employed. The classification results have achieved a sufficient accuracy to aid artificial classification effectively; therefore, much manpower and working time can be saved.

Styles APA, Harvard, Vancouver, ISO, etc.

39

Tan, Chia-Chen, et 談家珍. « An Intelligent Web-Page Classifier with Fair Feature-Subnet Selection ». Thesis, 2000. http://ndltd.ncl.edu.tw/handle/80632212675707369174.

Texte intégral

Résumé :

碩士
國立臺灣科技大學
電子工程系
88
The explosion of on-line information has given rise to many manually constructed topic hierarchies (such as Yahoo!!). But with the current growth rate in the amount of information manual classification in topic hierarchies creates an immense information bottleneck. Therefore, developing an automatic classifier is an urgent need. However, the classifiers suffer from the enormous dimensionality, since the dimensionality is determined by the number of distinct keywords in a document corpus. More seriously, most classifiers are either working slowly or they are constructed subjectively without learning ability. In this thesis, we address these problems with a novel evaluation function and an adaptive fuzzy learning network (AFLN). First, to reduce the enormous dimensionality, we employ a novel evaluation function to be used in feature subnet selection algorithm. Further, we develop AFLN for classifying new documents into existing manually generated hierarchies. In contract to approaches, the evaluation function is sound theoretical, give equal treatment to each category and has ability to identify both positive and negative features. On the other hand, the AFLN provide extremely fast training and testing time and, more importantly, it has leaning ability to learn the human knowledge. In short, our methods will allow large amounts of information to be organized and presented to users in a comprehensible way. By alleviating the information bottleneck, we hope to help users with the problems of information access on the

Styles APA, Harvard, Vancouver, ISO, etc.

40

Wang, Ding-En, et 王鼎恩. « Features Selection and GMM Classifier for Multi-Pose Face Recognition ». Thesis, 2015. http://ndltd.ncl.edu.tw/handle/03408850317662581389.

Texte intégral

Résumé :

碩士
國立東華大學
資訊工程學系
103
Face recognition is widely used in security application, such as homeland security, video surveillance, law enforcement, and identity management. However, there are still some problems in face recognition system. The main problems include the light changes, facial expression changes, pose variations and partial occlusion. Although many face recognition approaches reported satisfactory performance, their successes are limited to the conditions of controlled environment. In fact, pose variation has been identified as one of the most current problems in the real world. Therefore, many algorithms focusing on how to handle pose variation have received much attention. To solve the pose variations problem, in this thesis, we propose a multi-pose face recognition system based on an effective design of classifier using SURF feature. In training phase, the proposed method utilizes SURF features to calculate similarity between two images from different poses of the same face. Face recognition model (GMM) is trained using the robust SURF features from different poses. In testing phase, feature vectors corresponding to the test images are input to all trained models for the decision of the recognized face. Experiment results show that the performance of the proposed method is better than other existing methods.

Styles APA, Harvard, Vancouver, ISO, etc.

41

Lin, Ching-chiang, et 林靖強. « Selection of Relevant Features for Multi-Relational Naive Bayesian Classifier ». Thesis, 2010. http://ndltd.ncl.edu.tw/handle/30511079503769361755.

Texte intégral

Résumé :

碩士
國立中正大學
資訊管理所暨醫療資訊管理所
98
Most structured data is stored in relational databases, which is stored in multiple relations by their characters. To mine on data, we often join several relations to form a single relation through foreign key links. The process is often called “flatten”. Unfortunately, flatten may cause some problems such as time consuming and statistical skew on data. Hence, how to mine data directly on numerous relations become an arresting issue. Multi-relational data mining (MRDM) has been successfully applied in a variety of areas, such as marketing, sales, finance, fraud detection, and natural sciences. There has been many ILP-based methods proposed in previous researches, but there are still other problems unresolved such as scalability. Irrelevant or redundant attributes in a relation may not make contribution to classification accuracy. Thus, feature selection is an essential data processing step in multi-relational data mining. By filtering out irrelevant or redundant features from relations for data mining, we improve classification accuracy, achieve good time performance, and improve comprehensibility of the models. We propose a hybrid feature selection approach called Hybrid-BC, which train multi-relational naïve Bayesian classifier to classify or label unknown data. We set different cutoff values to filter features for the Hybrid-BC in order to observe the impact on classification accuracy. The experimental results shows that effectively choosing a small set of relevant features results in enhancing classification accuracy.

Styles APA, Harvard, Vancouver, ISO, etc.

42

CHEN, YU-XUN, et 陳俞勳. « A Study on Feature Selection Methods for the Nonspecific Classifier ». Thesis, 2018. http://ndltd.ncl.edu.tw/handle/7kg6m4.

Texte intégral

Résumé :

碩士
國立雲林科技大學
資訊管理系
106
With the rapid development of network technology and information technology, a large amount of data has been generated. In order to effectively process a large amount of data, reduce a number of features, and retain the accuracy of classification, feature selection is required. The feature selection is a process that can effectively select the more important features. It can also reduce the dimension and make learning algorithm operate faster and more smoothly. The feature selection methods are divided into three kinds: filter method, wrapper method, and embedded method. The filter method is independent of the classifier, and the feature selection method of this kind is used. The study uses many different feature selection methods to select the more important features (attributes) in datasets that are multivariate data type and categorical attribute types. Then, the only selected features (attributes) in datasets will be used to classify in many different classifiers to evaluate the performance of the many different feature selection methods by using the correct rate. The experiment uses six feature selection methods, six datasets, and three classifiers, and the result is that the integrated feature selection method of IG and FCBF can get the best performance in the following evaluations: the dataset facet, the classifier facet, and the integration of the two facets.

Styles APA, Harvard, Vancouver, ISO, etc.

43

Tzu-ChienLien et 連子建. « Feature selection methods with hybrid discretizationfor naive Bayesian classifiers ». Thesis, 2012. http://ndltd.ncl.edu.tw/handle/95105249459952662675.

Texte intégral

Résumé :

碩士
國立成功大學
資訊管理研究所
100
Naïve Bayesian classifier is widely used for classification problems, because of its computational efficiency and competitive accuracy. Discretization is one of the major approaches for processing continuous attributes for naïve Bayesian classifier. Hybrid discretization sets the method for discretizing each continuous attribute individually. A previous study found that hybrid discretization is a better approach to improve the performance of the naïve Bayesian classifier than unified discretization. Selective naïve Bayes, abbreviated as SNB, is an important feature selection method for naïve Bayesian classifiers. It improves the efficiency and the accuracy by reducing redundant and irrelevant attributes. The object of this study is to develop methods composed of hybrid discretization and feature selection, and three methods for this purposed are proposed. Method one that is the most efficient executes hybrid discretization after feature selection. Methods two and three generally perform hybrid discretization first followed by feature selection. Method two transforms continuous attributes without considering discrete attributes, while method three determines the best discretization methods for each continuous attribute by searching all possibilities. The experimental results shows that in general, the three methods with hybrid discretization and feature selection all have better performance than the method with unified discretization and feature selection, and method three is the best.

Styles APA, Harvard, Vancouver, ISO, etc.

44

Chuang, Chun-Hsiang, et 莊鈞翔. « Subspace Selection based Multiple Classifier System for High Dimensional Data Classification ». Thesis, 2009. http://ndltd.ncl.edu.tw/handle/tjqxm6.

Texte intégral

Résumé :

碩士
國立臺中教育大學
教育測驗統計研究所
97
In a typical supervised classification task, the size of training data fundamentally affects the generality of a classifier. Given a finite and fixed size of training data, the classification result may be degraded as the number of features (dimensionality) increase. Many researches have demonstrated that multiple classifier systems or so-called ensembles can alleviate small sample size and high dimensionality concern, and obtain more outstanding and robust results than single models. One of the effective approaches for generating an ensemble of diverse base classifiers is the use of different feature subsets such as random subspace method (RSM). Objectives of this research are to develop a novel ensemble technique named cluster based dynamic subspace method (CDSM) for strengthening RSM. This work is comprised of three phases. First, the relationships between feature vectors are explored by clustering algorithms. Second, two importance distributions are proposed to impose on the process of selecting subspaces. The functions of them provide rules for automatically selecting a suitable subspace dimensionality and the component dimensions, respectively. Finally, to utilize the spectral and spatial information contained in hyperspectral image data and enhance the performance and robustness of CDSM, two nonparametric contextual classifiers based on the Markov random field (MRF) are developed. The real data experimental results show that the proposed method obtains sound performances than the other conventional subspace methods especially when the ensemble size is small.

Styles APA, Harvard, Vancouver, ISO, etc.

45

Yu-Lu, Jou, et 周育祿. « An Efficient Fuzzy Classifier with Feature Selection Based on Fuzzy Entropy ». Thesis, 1998. http://ndltd.ncl.edu.tw/handle/93257518701023507869.

Texte intégral

Résumé :

碩士
國立臺灣科技大學
電子工程技術研究所
86
This thesis presents an efficient fuzzy classifier with the ability of feature selection based on fuzzy entropy measure. The fuzzy entropy is employed to evaluate the information of pattern distribution in the pattern space. With such information, we can apply it to partition the pattern space into non-overlapped decision regions for pattern classification. Since the decision regions do not overlap, the complexity and computational load of the classifier are reduced and thus the training time and classification time are extremely fast. Although the decision regions are partitioned as non-overlapped subspaces, we can also achieve good performance by the produced smooth boundaries since the decision regions are fuzzy subspaces. In addition, we also investigate a fuzzy entropy-based method to select the relevant features. The feature selection procedure not only reduces the dimension of a problem but also discards the noise-corrupted, redundant or unimportant features. As a result, the time consuming of the classifier is reduced whereas the classification performance is increased. Finally, we apply the proposed classifier on the Iris database and Wisconsin breast cancer database to evaluate the classification performance. Both of the results show that the proposed classifier can work well for the pattern classification applications.

Styles APA, Harvard, Vancouver, ISO, etc.

46

Makrehchi, Masoud. « Feature Ranking for Text Classifiers ». Thesis, 2007. http://hdl.handle.net/10012/3250.

Texte intégral

Résumé :

Feature selection based on feature ranking has received much attention by researchers in the field of text classification. The major reasons are their scalability, ease of use, and fast computation. %, However, compared to the search-based feature selection methods such as wrappers and filters, they suffer from poor performance. This is linked to their major deficiencies, including: (i) feature ranking is problem-dependent; (ii) they ignore term dependencies, including redundancies and correlation; and (iii) they usually fail in unbalanced data. While using feature ranking methods for dimensionality reduction, we should be aware of these drawbacks, which arise from the function of feature ranking methods. In this thesis, a set of solutions is proposed to handle the drawbacks of feature ranking and boost their performance. First, an evaluation framework called feature meta-ranking is proposed to evaluate ranking measures. The framework is based on a newly proposed Differential Filter Level Performance (DFLP) measure. It was proved that, in ideal cases, the performance of text classifier is a monotonic, non-decreasing function of the number of features. Then we theoretically and empirically validate the effectiveness of DFLP as a meta-ranking measure to evaluate and compare feature ranking methods. The meta-ranking framework is also examined by a stopword extraction problem. We use the framework to select appropriate feature ranking measure for building domain-specific stoplists. The proposed framework is evaluated by SVM and Rocchio text classifiers on six benchmark data. The meta-ranking method suggests that in searching for a proper feature ranking measure, the backward feature ranking is as important as the forward one. Second, we show that the destructive effect of term redundancy gets worse as we decrease the feature ranking threshold. It implies that for aggressive feature selection, an effective redundancy reduction should be performed as well as feature ranking. An algorithm based on extracting term dependency links using an information theoretic inclusion index is proposed to detect and handle term dependencies. The dependency links are visualized by a tree structure called a term dependency tree. By grouping the nodes of the tree into two categories, including hub and link nodes, a heuristic algorithm is proposed to handle the term dependencies by merging or removing the link nodes. The proposed method of redundancy reduction is evaluated by SVM and Rocchio classifiers for four benchmark data sets. According to the results, redundancy reduction is more effective on weak classifiers since they are more sensitive to term redundancies. It also suggests that in those feature ranking methods which compact the information in a small number of features, aggressive feature selection is not recommended. Finally, to deal with class imbalance in feature level using ranking methods, a local feature ranking scheme called reverse discrimination approach is proposed. The proposed method is applied to a highly unbalanced social network discovery problem. In this case study, the problem of learning a social network is translated into a text classification problem using newly proposed actor and relationship modeling. Since social networks are usually sparse structures, the corresponding text classifiers become highly unbalanced. Experimental assessment of the reverse discrimination approach validates the effectiveness of the local feature ranking method to improve the classifier performance when dealing with unbalanced data. The application itself suggests a new approach to learn social structures from textual data.

Styles APA, Harvard, Vancouver, ISO, etc.

47

Freeman, Cecille. « Feature selection and hierarchical classifier design with applications to human motion recognition ». Thesis, 2014. http://hdl.handle.net/10012/8480.

Texte intégral

Résumé :

The performance of a classifier is affected by a number of factors including classifier type, the input features and the desired output. This thesis examines the impact of feature selection and classification problem division on classification accuracy and complexity. Proper feature selection can reduce classifier size and improve classifier performance by minimizing the impact of noisy, redundant and correlated features. Noisy features can cause false association between the features and the classifier output. Redundant and correlated features increase classifier complexity without adding additional information. Output selection or classification problem division describes the division of a large classification problem into a set of smaller problems. Problem division can improve accuracy by allocating more resources to more difficult class divisions and enabling the use of more specific feature sets for each sub-problem. The first part of this thesis presents two methods for creating feature-selected hierarchical classifiers. The feature-selected hierarchical classification method jointly optimizes the features and classification tree-design using genetic algorithms. The multi-modal binary tree (MBT) method performs the class division and feature selection sequentially and tolerates misclassifications in the higher nodes of the tree. This yields a piecewise separation for classes that cannot be fully separated with a single classifier. Experiments show that the accuracy of MBT is comparable to other multi-class extensions, but with lower test time. Furthermore, the accuracy of MBT is significantly higher on multi-modal data sets. The second part of this thesis focuses on input feature selection measures. A number of filter-based feature subset evaluation measures are evaluated with the goal of assessing their performance with respect to specific classifiers. Although there are many feature selection measures proposed in literature, it is unclear which feature selection measures are appropriate for use with different classifiers. Sixteen common filter-based measures are tested on 20 real and 20 artificial data sets, which are designed to probe for specific feature selection challenges. The strengths and weaknesses of each measure are discussed with respect to the specific feature selection challenges in the artificial data sets, correlation with classifier accuracy and their ability to identify known informative features. The results indicate that the best filter measure is classifier-specific. K-nearest neighbours classifiers work well with subset-based RELIEF, correlation feature selection or conditional mutual information maximization, whereas Fisher's interclass separability criterion and conditional mutual information maximization work better for support vector machines. Based on the results of the feature selection experiments, two new filter-based measures are proposed based on conditional mutual information maximization, which performs well but cannot identify dependent features in a set and does not include a check for correlated features. Both new measures explicitly check for dependent features and the second measure also includes a term to discount correlated features. Both measures correctly identify known informative features in the artificial data sets and correlate well with classifier accuracy. The final part of this thesis examines the use of feature selection for time-series data by using feature selection to determine important individual time windows or key frames in the series. Time-series feature selection is used with the MBT algorithm to create classification trees for time-series data. The feature selected MBT algorithm is tested on two human motion recognition tasks: full-body human motion recognition from joint angle data and hand gesture recognition from electromyography data. Results indicate that the feature selected MBT is able to achieve high classification accuracy on the time-series data while maintaining a short test time.

Styles APA, Harvard, Vancouver, ISO, etc.

48

Feng, Kuan-Jen, et 馮冠仁. « An Efficient Hierarchical Metadata Classifier based on SVM and Feature Selection Methods ». Thesis, 2006. http://ndltd.ncl.edu.tw/handle/71704686561687980607.

Texte intégral

Résumé :

碩士
國立暨南國際大學
資訊工程學系
94
Constructing a Web portal via integrating different contents from various information systems is crucial for providing public, popular and friendly services. In this thesis, we propose a hierarchical classifier system toward to fusing heterogeneous categories from various information systems. Employing traditional text classification methods that classify documents into predefined categories to deal with the problem is a possible solution. However, traditional methods suffer from drawbacks of huge text features and flat classification without considering hierarchical structures. Feature selection methods tend to select features from large-sized classes so that the classification performance for small-sized classes is poor. Flat classification regards hierarchical classes as flat-structured classes. In this way, each category corresponds to a single classifier that tends to select features to distinguish the class from all of remains. Therefore, discriminative features are hard to be effectively selected since the hierarchical knowledge is not applied to enhance the classification task. To deal with above problems, we propose feature selection methods to avoid the process being dominated by large-size classes. Based on the SVM classification method, we propose a hierarchical classification method to support classifications on hierarchical portal objects with metadata. We also employ domain concept hierarchies as the background knowledge to improve feature selection and classification processes by using the portal’s hierarchical knowledge. The NMNS portal is used as the test bed. Experiments show that our hierarchical classifier, with outstanding 98.5% F-measure, is more efficient than traditional flat classifier.

Styles APA, Harvard, Vancouver, ISO, etc.

49

Nimbalkar, Prakash. « Optimal HySpex band selection for roof classification determined from supervised classifier efficiency ». Doctoral thesis, 2022. https://depotuw.ceon.pl/handle/item/4155.

Texte intégral

Résumé :

The urban ecosystem is characterized by complex structures and vast heterogeneity in the surface materials. Urban surfaces have an impact on their surrounding environment which necessitates urban mapping. The urban area is complex that seeks a robust supervised classification method. Hyperspectral remote sensing has emerged with potential for the quantification of urban surfaces to deliver an accurate and up-to-date inventory. However, hyperspectral imaging (HSI) suffers for dimensionality, which requires larger storage, data noise, redundancy and higher computing instruments. Therefore, this study aimed to derive optimal bands for the characterization of urban surfaces in the city of Białystok. The band optimization was done using data reduction methods and classifiers efficiency. The algorithm is tested using two experimental setups; first- roof surfaces characterization, and the second- an entire urban area characterization. This study has used LiDAR and airborne HySpex image. This study proposes a band selection method called Principal Component Analysis-Band Selection (PCA–BS) and compared to transformation methods-PCA and Minimum Noise Fraction (MNF). This study employed and compared the Artificial Neural Network (ANN), Support Vector Machine (SVM), Spectral Angle Mapper (SAM) and Spectral Information Divergence (SID) in the characterization of urban surfaces. The results of both experiments confirmed that band selection method PCA–BS method is superior to the transformation methods (PCA and MNF) in offering best result. Initial band optimization using 30 PCA-BS bands in urban area delivered the highest Overall Accuracies (OAs) such as ANN 94.34% and SVM 88.72%. In roof classification; ANN resulted in OA of 90.85% and 0.90 Kappa. Overall, ANN and SVM classifiers scored the best results, whereas SAM and SID performed poorly in the classification. Precisely, the PCA–BS method and classifiers allowed the extraction of 10 optimal bands that are capable of offering accuracy of 83.2% and 86.63% using SVM and ANN in the urban area. Using LiDAR with HySpex increased discrimination potential and elevated the overall accuracies by 6%.
Ekosystem miasta charakteryzuje się dużą różnorodnością struktur oraz materiałów budulcowych, które mają realny wpływ na funkcjonowanie nie tylko samego miasta, ale też i otoczenia. Oznacza to konieczność opracowania szczegółowych metod kartowania i monitoringu. Jedne z najlepszych narzędzi oferuje teledetekcja hiperspektralna, gdyż bazując na bardzo dużej rozdzielczości spektralnej, radiometrycznej oraz przestrzennej, umożliwia dokładne kartowanie oraz ilościową analizę zachodzących zmian. Jednakże dane hiperspektralne ze względu na rozmiary wymagają dużych zasobów komputerowych do przetwarzania obrazów. W związku z tym niniejsza rozprawa miała na celu opracowanie metody selekcji najbardziej informacyjnych kanałów prezentujących cechy spektralne powierzchni występujących miastach (na przykładzie Białegostoku). Optymalizację doboru kanałów przeprowadzono stosując metody redukcji danych i oceniając dokładność klasyfikacji poszczególnych zestawów danych. Algorytm był testowany przy użyciu dwóch eksperymentalnych metod identyfikujących a) pokrycia dachowe oraz b) pokrycie terenu. Badania bazowały na lotniczych danych LiDAR oraz hiperspektralnych HySpex. Pierwszym krokiem schematu badawczego były korekcje danych (atmosferyczna, geometryczna), następnie zastosowano metodę wyboru kanałów PCA-BS (Principal Component Analysis-Band Selection) oraz porównawczą metodą PCA i Minimum Noise Fraction (MNF). W kolejnym kroku zastosowano kilka klasyfikatorów oceniających dokładność klasyfikacji wybranych kanałów. Wybrane algorytmy klasyfikujące to: sztuczne sieci neuronowe (ANN), Support Vector Machine (SVM), Spectral Angle Mapper (SAM) and Spectral Information Divergence (SID). Wyniki obu eksperymentów potwierdziły, że najlepsze wyniki selekcji danych oferuje metoda PCA-BS, w następnej kolejności PCA-BS, MNF i PCA. Zestaw zawierający 30 kanałów pochodzących z metody PCA-BS zapewniła najwyższe dokładności - dokładność całkowita równa 94,34% dla klasyfikatora ANN oraz 88,72% w przypadku SVM. Dokładność całkowita klasyfikacji dachów metodą ANN wyniosła 90,85%, a współczynnik kappa 0,9. Spośród klasyfikatorów najlepsze wyniki uzyskano z ANN i SVM. Gorsze wyniki uzyskano z SAM i SID. Reasumując, metoda PCA-BS pozwoliła na wybór 10 optymalnych kanałów, które pozwalają uzyskać dokładności rzędu 83,2% i 86,63% przy użyciu klasyfikatorów SVM i ANN. Dodanie danych LiDAR do zestawu HySpex poprawiło wyniki (dokładność całkowitą) o 6%.

Styles APA, Harvard, Vancouver, ISO, etc.

50

Cheng, Shao-wei, et 鄭紹偉. « Ensemble classifier with feature selection and multi-words for disease code assignment ». Thesis, 2009. http://ndltd.ncl.edu.tw/handle/85980912556252350164.

Texte intégral

Résumé :

碩士
雲林科技大學
資訊管理系碩士班
97
After the National Health Insurance (NHI) was executed, the Health Insurance Bureau stipulates that the hospitals have to report medical records with ICD-9-CM when applying for reimbursement of medical expense. If they don’t conform to this rule, they wouldn’t be subsidized. Especially filing incorrect codes or omitted codes, the medical reimbursement will be deleted or deducted. In determining correct ICD-9-CM corresponding with discharge summary, medical staffs has to manually check each document. This labor intensive work wastes human resources and time. In prior research, using domain knowledge to extend concepts of a document term was studied. However, those terms are limited to single-word terms and don’t contain multi-word terms. In addition, the meaning of codes is similar in subcategories under the same category, and the imbalanced data problem exists. This study focuses on keyword selection and multi-word terms expansion, and then combines the ensemble technique to enhance the performance of SVM and Bayes classifier in determining the disease code of medical documents. The experimental results proved the chi-square method could select keywords with better quality, the multi-word and extended-word can contain more information of medical documents, and the ensemble method Adaboost could improve the classification performance of Bayes classifier.

Styles APA, Harvard, Vancouver, ISO, etc.

Thèses sur le sujet « Selective classifier »

Créez une référence correcte selon les styles APA, MLA, Chicago, Harvard et plusieurs autres