Dissertations / Theses: 'Data Classification'

1

Morshedzadeh, Iman. "Data Classification in Product Data Management." Thesis, Högskolan i Skövde, Institutionen för teknik och samhälle, 2013. http://urn.kb.se/resolve?urn=urn:nbn:se:his:diva-14651.

Full text

Abstract:

This report is about the product data classification methodology that is useable for the Volvo Cars Engine (VCE) factory's production data, and can be implemented in the Teamcenter software. There are many data generated during the life cycle of each product, and companies try to manage these data with some product data management software. Data classification is a part of data management for most effective and efficient use of data. With surveys that were done in this project, items affecting the data classification have been found. Data, attributes, classification method, Volvo Cars Engine factory and Teamcenter as the product data management software, are items that are affected data classification. In this report, all of these items will be explained separately. With the knowledge obtained about the above items, in the Volvo Cars Engine factory, the suitable hierarchical classification method is described. After defining the classification method, this method has been implemented in the software at the last part of the report to show that this method is executable.

APA, Harvard, Vancouver, ISO, and other styles

2

Currie, Sheila. "Data classification for choropleth mapping." Thesis, University of Ottawa (Canada), 1989. http://hdl.handle.net/10393/5725.

Full text

APA, Harvard, Vancouver, ISO, and other styles

3

Gómez, Juan Martínez. "Automatic classification of neural data." Thesis, University of Leicester, 2011. http://hdl.handle.net/2381/9696.

Full text

Abstract:

In this thesis we present a new solution for an automatic classification of the single-neuron activity. The study of the computational role of individual neurons underlying different cognitive process is a gold standard in Neuroscience. This type of analysis is done first, by recording the extracellular spikes of the neurons near the tip of a microelectrode and second, by isolating the spikes of the recorded cells based on the similarity of their shapes using a method called spike sorting. In recent years, important advances in microelectrode technology allow us now to perform massive parallel recordings using a high number of channels with the possibility to study the activity of large ensembles of neurons at a time. However, this fascinating opportunity introduces at the same time a challenge for the efficient and fast analysis of this data. In this research work, we address this problem by developing a new implementation for unsupervised spike sorting that improves the performance of a widely-used spike sorting algorithm, increasing the number of automatically identified neurons. Moreover, we developed a new testing platform which generates simulations of extracellular recordings including challenging conditions such as realistic noise, multi-unit activity -spikes of distant neurons impossible to be identified as single units- or the presence of neurons with low firing rates. In summary, the results presented here provide contributions to the development of automated and efficient quantitative frameworks for the analysis of multiple-channel recordings that help us to understand single-neuron population codes.

APA, Harvard, Vancouver, ISO, and other styles

4

Pötzelberger, Klaus, and Helmut Strasser. "Data Compression by Unsupervised Classification." Department of Statistics and Mathematics, WU Vienna University of Economics and Business, 1997. http://epub.wu.ac.at/974/1/document.pdf.

Full text

Abstract:

This paper deals with a general class of classification methods which are related both to vector quantization in the sense of Pollard, [12], as well as to competitive learning in the sense of Kohonen, [10]. The basic duality of minimum variance partitioning and vector quantization known from statistical cluster analysis is shown to be true for this whole class of classification problems. The paper contains theoretical results like existence of optima, consistency of approximate optima and characterization of local optima as fixpoints of a fix point algorithm. A fix point algorithm is proposed and its termination after finite time is proved for empirical distributions. The construction of a particular classification method is based on a statistical information measure specified by a convex function. Modifying this convex function gives room for suggesting a large variety of new classification procedures, e.g. of robust quantifiers. (author's abstract)
Series: Forschungsberichte / Institut für Statistik

APA, Harvard, Vancouver, ISO, and other styles

5

Soukhoroukova, Nadejda. "Data classification through nonsmooth optimization." Thesis, University of Ballarat [Mt. Helen, Vic.] :, 2003. http://researchonline.federation.edu.au/vital/access/HandleResolver/1959.17/42220.

Full text

APA, Harvard, Vancouver, ISO, and other styles

6

Kröger, Viktor. "Classification in Functional Data Analysis : Applications on Motion Data." Thesis, Umeå universitet, Institutionen för matematik och matematisk statistik, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-184963.

Full text

Abstract:

Anterior cruciate knee ligament injuries are common and well known, especially amongst athletes.These injuries often require surgeries and long rehabilitation programs, and can lead to functionloss and re-injuries (Marshall et al., 1977). This work aims to explore the possibility of applyingsupervised classification on knee functionality, using different types of models, and testing differentdivisions of classes. The data used is gathered through a performance test, where individualsperform one-leg hops with motion sensors attached to their bodies. The obtained data representsthe position over time, and is considered functional data.With functional data analysis (FDA), a process can be analysed as a continuous function of time,instead of being reduced to finite data points. FDA includes many useful tools, but also somechallenges. A functional observation can for example be differentiated, a handy tool not found inthe multivariate tool-box. The speed, and acceleration, can then be calculated from the obtaineddata. How to define "similarity" is, on the other hand, not as obvious as with points. In this work,an FDA-approach is taken on classifying knee kinematic data, from a long-term follow-up studyon knee ligament injuries.This work studies kernel functional classifiers, and k-nearest neighbours models, and performssignificance tests on the model accuracy, using re-sampling methods. Additionally, depending onhow similarity is defined, the models can distinguish different features of the data. Attempts atutilising more information through incorporation of ensemble-methods, does not exceed the singlemodels it is created from. Further, it is shown that classification on optimised sub-domains, canbe superior to classifiers using the full domain, in terms of predictive power.
Främre korsbandsskador är vanliga och välkända skador, speciellt bland idrottsutövare. Skadornakräver ofta operationer och långa rehabiliteringsprogram, och kan leda till funktionell nedsättningoch återskador (Marshall et al., 1977). Målet med det här arbetet är att utforska möjligheten attklassificera knän utifrån funktionalitet, där utfallet är känt. Detta genom att använda olika typerav modeller, och genom att testa olika indelningar av grupper. Datat som används är insamlatunder ett prestandatest, där personer hoppat på ett ben med rörelsesensorer på kroppen. Deninsamlade datan representerar position över tid, och betraktas som funktionell data.Med funktionell dataanalys (FDA) kan en process analyseras som en kontinuerlig funktion av tid,istället för att reduceras till ett ändligt antal datapunkter. FDA innehåller många användbaraverktyg, men även utmaningar. En funktionell observation kan till exempel deriveras, ett händigtverktyg som inte återfinns i den multivariata verktygslådan. Hastigheten och accelerationen kandå beräknas utifrån den insamlade datan. Hur "likhet" är definierat, å andra sidan, är inte likauppenbart som med punkt-data. I det här arbetet används FDA för att klassificera knärörelsedatafrån en långtidsuppföljningsstudie av främre korsbandsskador.I detta arbete studeras både funktionella kärnklassificerare och k-närmsta grannar-metoder, och ut-för signifikanstest av modellträffsäkerheten genom omprovtagning. Vidare kan modellerna urskiljaolika egenskaper i datat, beroende på hur närhet definieras. Ensemblemetoder används i ett försökatt nyttja mer av informationen, men lyckas inte överträffa någon av de enskilda modellerna somutgör ensemblen. Vidare så visas också att klassificering på optimerade deldefinitionsmängder kange en högre förklaringskraft än klassificerare som använder hela definitionsmängden.

APA, Harvard, Vancouver, ISO, and other styles

7

Lan, Liang. "Data Mining Algorithms for Classification of Complex Biomedical Data." Diss., Temple University Libraries, 2012. http://cdm16002.contentdm.oclc.org/cdm/ref/collection/p245801coll10/id/214773.

Full text

Abstract:

Computer and Information Science
Ph.D.
In my dissertation, I will present my research which contributes to solve the following three open problems from biomedical informatics: (1) Multi-task approaches for microarray classification; (2) Multi-label classification of gene and protein prediction from multi-source biological data; (3) Spatial scan for movement data. In microarray classification, samples belong to several predefined categories (e.g., cancer vs. control tissues) and the goal is to build a predictor that classifies a new tissue sample based on its microarray measurements. When faced with the small-sample high-dimensional microarray data, most machine learning algorithm would produce an overly complicated model that performs well on training data but poorly on new data. To reduce the risk of over-fitting, feature selection becomes an essential technique in microarray classification. However, standard feature selection algorithms are bound to underperform when the size of the microarray data is particularly small. The best remedy is to borrow strength from external microarray datasets. In this dissertation, I will present two new multi-task feature filter methods which can improve the classification performance by utilizing the external microarray data. The first method is to aggregate the feature selection results from multiple microarray classification tasks. The resulting multi-task feature selection can be shown to improve quality of the selected features and lead to higher classification accuracy. The second method jointly selects a small gene set with maximal discriminative power and minimal redundancy across multiple classification tasks by solving an objective function with integer constraints. In protein function prediction problem, gene functions are predicted from a predefined set of possible functions (e.g., the functions defined in the Gene Ontology). Gene function prediction is a complex classification problem characterized by the following aspects: (1) a single gene may have multiple functions; (2) the functions are organized in hierarchy; (3) unbalanced training data for each function (much less positive than negative examples); (4) missing class labels; (5) availability of multiple biological data sources, such as microarray data, genome sequence and protein-protein interactions. As participants in the 2011 Critical Assessment of Function Annotation (CAFA) challenge, our team achieved the highest AUC accuracy among 45 groups. In the competition, we gained by focusing on the 5-th aspect of the problem. Thus, in this dissertation, I will discuss several schemes to integrate the prediction scores from multiple data sources and show their results. Interestingly, the experimental results show that a simple averaging integration method is competitive with other state-of-the-art data integration methods. Original spatial scan algorithm is used for detection of spatial overdensities: discovery of spatial subregions with significantly higher scores according to some density measure. This algorithm is widely used in identifying cluster of disease cases (e.g., identifying environmental risk factors for child leukemia). However, the original spatial scan algorithm only works on static spatial data. In this dissertation, I will propose one possible solution for spatial scan on movement data.
Temple University--Theses

APA, Harvard, Vancouver, ISO, and other styles

8

Lozano, Albalate Maria Teresa. "Data Reduction Techniques in Classification Processes." Doctoral thesis, Universitat Jaume I, 2007. http://hdl.handle.net/10803/10479.

Full text

Abstract:

The learning process consists of different steps: building a Training Set (TS), training the system, testing its behaviour and finally classifying unknown objects. When using a distance based rule as a classifier, i.e. 1-Nearest Neighbour (1-NN), the first step (building a training set) includes editing and condensing data. The main reason for that is that the rules based on distance need many time to classify each unlabelled sample, x, as each distance from x to each point in the training set should be calculated. So, the more reduced the training set, the shorter the time needed for each new classification process. This thesis is mainly focused on building a training set from some already given data, and specially on condensing it; however different classification techniques are also compared.
The aim of any condensing technique is to obtain a reduced training set in order to spend as few time as possible in classification. All that without a significant loss in classification accuracy. Some
new approaches to training set size reduction based on prototypes are presented. These schemes basically consist of defining a small number of prototypes that represent all the original instances. That includes those approaches that select among the already existing examples (selective condensing algorithms), and those which generate new representatives (adaptive condensing algorithms).
Those new reduction techniques are experimentally compared to some traditional ones, for data represented in feature spaces. In order to test them, the classical 1-NN rule is here applied. However, other classifiers (fast classifiers) have been considered here, as linear and quadratic ones constructed in dissimilarity spaces based on prototypes, in order to realize how editing and condensing concepts work for this different family of classifiers.
Although the goal of the algorithms proposed in this thesis is to obtain a strongly reduced set of representatives, the performance is empirically evaluated over eleven real data sets by comparing not only the reduction rate but also the classification accuracy with those of other condensing techniques. Therefore, the ultimate aim is not only to find a strongly reduced set, but also a balanced one.
Several ways to solve the same problem could be found. So, in the case of using a rule based on distance as a classifier, not only the option of reducing the training set can be afford. A different family of approaches consists of applying several searching methods. Therefore, results obtained by the use of the algorithms here presented are compared in terms of classification accuracy and time, to several efficient search techniques.
Finally, the main contributions of this PhD report could be briefly summarised in four principal points. Firstly, two selective algorithms based on the idea of surrounding neighbourhood. They obtain better results than other algorithms presented here, as well as better than other traditional schemes. Secondly, a generative approach based on mixtures of Gaussians. It presents better results in classification accuracy and size reduction than traditional adaptive algorithms, and similar to those of the LVQ. Thirdly, it is shown that classification rules other than the 1-NN can be used, even leading to better results. And finally, it is deduced from the experiments carried on, that with some databases (as the ones used here) the approaches here presented execute the classification processes in less time that the efficient search techniques.

APA, Harvard, Vancouver, ISO, and other styles

9

Aygar, Alper. "Doppler Radar Data Processing And Classification." Master's thesis, METU, 2008. http://etd.lib.metu.edu.tr/upload/12609890/index.pdf.

Full text

Abstract:

In this thesis, improving the performance of the automatic recognition of the Doppler radar targets is studied. The radar used in this study is a ground-surveillance doppler radar. Target types are car, truck, bus, tank, helicopter, moving man and running man. The input of this thesis is the output of the real doppler radar signals which are normalized and preprocessed (TRP vectors: Target Recognition Pattern vectors) in the doctorate thesis by Erdogan (2002). TRP vectors are normalized and homogenized doppler radar target signals with respect to target speed, target aspect angle and target range. Some target classes have repetitions in time in their TRPs. By the use of these repetitions, improvement of the target type classification performance is studied. K-Nearest Neighbor (KNN) and Support Vector Machine (SVM) algorithms are used for doppler radar target classification and the results are evaluated. Before classification PCA (Principal Component Analysis), LDA (Linear Discriminant Analysis), NMF (Nonnegative Matrix Factorization) and ICA (Independent Component Analysis) are implemented and applied to normalized doppler radar signals for feature extraction and dimension reduction in an efficient way. These techniques transform the input vectors, which are the normalized doppler radar signals, to another space. The effects of the implementation of these feature extraction algoritms and the use of the repetitions in doppler radar target signals on the doppler radar target classification performance are studied.

APA, Harvard, Vancouver, ISO, and other styles

10

Lee, Ho-Jin. "Functional data analysis: classification and regression." Texas A&M University, 2004. http://hdl.handle.net/1969.1/2805.

Full text

Abstract:

Functional data refer to data which consist of observed functions or curves evaluated at a finite subset of some interval. In this dissertation, we discuss statistical analysis, especially classification and regression when data are available in function forms. Due to the nature of functional data, one considers function spaces in presenting such type of data, and each functional observation is viewed as a realization generated by a random mechanism in the spaces. The classification procedure in this dissertation is based on dimension reduction techniques of the spaces. One commonly used method is Functional Principal Component Analysis (Functional PCA) in which eigen decomposition of the covariance function is employed to find the highest variability along which the data have in the function space. The reduced space of functions spanned by a few eigenfunctions are thought of as a space where most of the features of the functional data are contained. We also propose a functional regression model for scalar responses. Infinite dimensionality of the spaces for a predictor causes many problems, and one such problem is that there are infinitely many solutions. The space of the parameter function is restricted to Sobolev-Hilbert spaces and the loss function, so called, e-insensitive loss function is utilized. As a robust technique of function estimation, we present a way to find a function that has at most e deviation from the observed values and at the same time is as smooth as possible.

APA, Harvard, Vancouver, ISO, and other styles

11

Fernandez, Noemi. "Statistical information processing for data classification." FIU Digital Commons, 1996. http://digitalcommons.fiu.edu/etd/3297.

Full text

Abstract:

This thesis introduces new algorithms for analysis and classification of multivariate data. Statistical approaches are devised for the objectives of data clustering, data classification and object recognition. An initial investigation begins with the application of fundamental pattern recognition principles. Where such fundamental principles meet their limitations, statistical and neural algorithms are integrated to augment the overall approach for an enhanced solution. This thesis provides a new dimension to the problem of classification of data as a result of the following developments: (1) application of algorithms for object classification and recognition; (2) integration of a neural network algorithm which determines the decision functions associated with the task of classification; (3) determination and use of the eigensystem using newly developed methods with the objectives of achieving optimized data clustering and data classification, and dynamic monitoring of time-varying data; and (4) use of the principal component transform to exploit the eigensystem in order to perform the important tasks of orientation-independent object recognition, and di mensionality reduction of the data such as to optimize the processing time without compromising accuracy in the analysis of this data.

APA, Harvard, Vancouver, ISO, and other styles

12

Bocancea, Andreea. "Supervised Classification Leveraging Refined Unlabeled Data." Thesis, Linköpings universitet, Institutionen för datavetenskap, 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-119320.

Full text

Abstract:

This thesis focuses on how unlabeled data can improve supervised learning classi-fiers in all contexts, for both scarce to abundant label situations. This is meant toaddress the limitations within supervised learning with regards to label availability.Extending the training set with unlabeled data can overcome issues such as selec-tion bias, noise and insufficient data. Based on the overall data distribution andthe initial set of labels, semi-supervised methods provide labels for additional datapoints. The semi-supervised approaches considered in this thesis belong to one ofthe following categories: transductive SVMs, Cluster-then-Label and graph-basedtechniques. Further, we evaluate the behavior of: Logistic regression, Single layerperceptron, SVM and Decision trees. By learning on the extended training set,supervised classifiers are able to generalize better. Based on the results, this the-sis recommends data-processing and algorithmic solutions appropriate to real-worldsituations.

APA, Harvard, Vancouver, ISO, and other styles

13

Langdon, Matthew James. "Classification of images and censored data." Thesis, University of Leeds, 2005. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.434618.

Full text

APA, Harvard, Vancouver, ISO, and other styles

14

NUNES, BERNARDO PEREIRA. "AUTOMATIC CLASSIFICATION OF SEMI-STRUCTURED DATA." PONTIFÍCIA UNIVERSIDADE CATÓLICA DO RIO DE JANEIRO, 2009. http://www.maxwell.vrac.puc-rio.br/Busca_etds.php?strSecao=resultado&nrSeq=14382@1.

Full text

Abstract:

PONTIFÍCIA UNIVERSIDADE CATÓLICA DO RIO DE JANEIRO
O problema da classificação de dados remonta à criação de taxonomias visando cobrir áreas do conhecimento. Com o surgimento da Web, o volume de dados disponíveis aumentou várias ordens de magnitude, tornando praticamente impossível a organização de dados manualmente. Esta dissertação tem por objetivo organizar dados semi-estruturados, representados por frames, sem uma estrutura de classes prévia. A dissertação apresenta um algoritmo, baseado no K-Medóide, capaz de organizar um conjunto de frames em classes, estruturadas sob forma de uma hierarquia estrita. A classificação dos frames é feita a partir de um critério de proximidade que leva em conta os atributos e valores que cada frame possui.
The problem of data classification goes back to the definition of taxonomies covering knowledge areas. With the advent of the Web, the amount of data available has increased several orders of magnitude, making manual data classification impossible. This dissertation proposes a method to automatically classify semi-structured data, represented by frames, without any previous knowledge about structured classes. The dissertation introduces an algorithm, based on K-Medoid, capable of organizing a set of frames into classes, structured as a strict hierarchy. The classification of the frames is based on a closeness criterion that takes into account the attributes and their values in each frame.

APA, Harvard, Vancouver, ISO, and other styles

15

Van, der Walt Christiaan Maarten. "Data measures that characterise classification problems." Diss., Pretoria : [s.n.], 2008. http://upetd.up.ac.za/thesis/available/etd-08292008-162648/.

Full text

APA, Harvard, Vancouver, ISO, and other styles

16

Palm, Niklas. "Sentiment classification of Swedish Twitter data." Thesis, Uppsala universitet, Avdelningen för datalogi, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-388420.

Full text

Abstract:

Sentiment analysis is a field within the area of natural language processing that studies the sentiment of human written text. Within sentiment analysis, sentiment classification is a research area that has been of growing interest since the advent of digital social-media platforms, concerned with the classification of the subjective information in text data. Many studies have been conducted on sentiment classification, producing numerous of openly available tools and resources that further advance research, though almost exclusively for the English language. There are very few openly available Swedish resources that aid research, and sentiment classification research in non-English languages most often use English resources one way or another. The lack of non-English resources impedes research in other languages and there is very little research on sentiment classification using Swedish resources. This thesis addresses the lack of knowledge in this area by designing and implementing a sentiment classifier using Swedish resources, in order to evaluate how methods and best practices commonly used in English research transfer to Swedish. The results in this thesis indicate that Swedish resources can be used in the construction of internationally competitive sentiment classifiers and that methods commonly used in English research for pre- processing text data may not be optimal for the Swedish language.

APA, Harvard, Vancouver, ISO, and other styles

17

Tziatzios, Achilleas. "Data mining of range-based classification rules for data characterization." Thesis, Cardiff University, 2014. http://orca.cf.ac.uk/65902/.

Full text

Abstract:

Advances in data gathering have led to the creation of very large collections across different fields like industrial site sensor measurements or the account statuses of a financial institution's clients. The ability to learn classification rules, rules that associate specific attribute values with a specific class label, from this data is important and useful in a range of applications. While many methods to facilitate this task have been proposed, existing work has focused on categorical datasets and very few solutions that can derive classification rules of associated continuous ranges (numerical intervals) have been developed. Furthermore, these solutions have solely relied in classification performance as a means of evaluation and therefore focus on the mining of mutually exclusive classification rules and the correct prediction of the most dominant class values. As a result existing solutions demonstrate only limited utility when applied for data characterization tasks. This thesis proposes a method that derives range-based classification rules from numerical data inspired by classification association rule mining. The presented method searches for associated numerical ranges that have a class value as their consequent and meet a set of user defined criteria. A new interestingness measure is proposed for evaluating the density of range-based rules and four heuristic based approaches are presented for targeting different sets of rules. Extensive experiments demonstrate the effectiveness of the new algorithm for classification tasks when compared to existing solutions and its utility as a solution for data characterization.

APA, Harvard, Vancouver, ISO, and other styles

18

Davari, Mahdad. "Advances Towards Data-Race-Free Cache Coherence Through Data Classification." Doctoral thesis, Uppsala universitet, Avdelningen för datorteknik, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-320595.

Full text

Abstract:

Providing a consistent view of the shared memory based on precise and well-defined semantics—memory consistency model—has been an enabling factor in the widespread acceptance and commercial success of shared-memory architectures. Moreover, cache coherence protocols have been employed by the hardware to remove from the programmers the burden of dealing with the memory inconsistency that emerges in the presence of the private caches. The principle behind all such cache coherence protocols is to guarantee that consistent values are read from the private caches at all times. In its most stringent form, a cache coherence protocol eagerly enforces two invariants before each data modification: i) no other core has a copy of the data in its private caches, and ii) all other cores know where to receive the consistent data should they need the data later. Nevertheless, by partly transferring the responsibility for maintaining those invariants to the programmers, commercial multicores have adopted weaker memory consistency models, namely the Total Store Order (TSO), in order to optimize the performance for more common cases. Moreover, memory models with more relaxed invariants have been proposed based on the observation that more and more software is written in compliance with the Data-Race-Free (DRF) semantics. The semantics of DRF software can be leveraged by the hardware to infer when data in the private caches might be inconsistent. As a result, hardware ignores the inconsistent data and retrieves the consistent data from the shared memory. DRF semantics therefore removes from the hardware the burden of eagerly enforcing the strong consistency invariants before each data modification. Instead, consistency is guaranteed only when needed. This results in manifold optimizations, such as reducing the energy consumption and improving the performance and scalability. The efficiency of detecting and discarding the inconsistent data is an important factor affecting the efficiency of such coherence protocols. For instance, discarding the consistent data does not affect the correctness, but results in performance loss and increased energy consumption. In this thesis we show how data classification can be leveraged as an effective tool to simplify the cache coherence based on the DRF semantics. In particular, we introduce simple but efficient hardware-based private/shared data classification techniques that can be used to efficiently detect the inconsistent data, thus enabling low-overhead and scalable cache coherence solutions based on the DRF semantics.

APA, Harvard, Vancouver, ISO, and other styles

19

Better, Marco L. "Data mining techniques for prediction and classification in discrete data applications." Connect to online resource, 2007. http://gateway.proquest.com/openurl?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation&res_dat=xri:pqdiss&rft_dat=xri:pqdiss:3273688.

Full text

APA, Harvard, Vancouver, ISO, and other styles

20

Tennant, Mark. "A parallel data stream classification technique for high velocity data streams." Thesis, University of Reading, 2018. http://centaur.reading.ac.uk/77919/.

Full text

Abstract:

Real-time classification of data streams remains one of the most challenging aspects of Big Data. As a data stream is an unending source of information, classification models and metrics must be created and adapted in real-time as the data is made available to them. This time constrained learning is problematic, conventional data models require a training period to examine the data and produce models for evaluation. In data stream mining this training period does not exist, instead the models are continuously updated in real-time. As data streams become faster and larger the quantity of data to be processed can overwhelm a single machines’ learning capabilities. One method to reduce the work load upon a data mining algorithm is to implement parallel solutions. This has the benefit of distributing the classification over one or more machines. Unfortunately, most parallel implementations of classification algorithms are not suitable for real-time processing, and most data stream mining algorithms are not suitable for parallelisation. This research develops on real-time parallel classification of data instances with respect to vast amounts of data. The proposed solution is vastly scalable as it incurs no additional communications costs when training. Moreover, it is capable of accepting data streams that contain multiple sources. The newly created algorithm Parallel MC-NN has been implemented and evaluated on open source parallel technologies. The results of experimentation show a scalable solution that has been evaluated and peer reviewed via multiple publications.

APA, Harvard, Vancouver, ISO, and other styles

21

Botella, Pérez Cristina. "Multivariate classification of gene expression microarray data." Doctoral thesis, Universitat Rovira i Virgili, 2010. http://hdl.handle.net/10803/9046.

Full text

Abstract:

L'expressiódels gens obtinguts de l'anàliside microarrays s'utilitza en molts casos, per classificar les cèllules. En aquestatesi, unaversióprobabilística del mètodeDiscriminant Partial Least Squares (p-DPLS)s'utilitza per classificar les mostres de les expressions delsseus gens. p-DPLS esbasa en la regla de Bayes de la probabilitat a posteriori. Aquestsclassificadorssónforaçats a classficarsempre.Per superaraquestalimitaciós'haimplementatl'opció de rebuig.Aquestaopciópermetrebutjarlesmostresamb alt riscd'errors de classificació (és a dir, mostresambigüesi outliers).Aquestaopció de rebuigcombinacriterisbasats en els residuals x, el leverage ielsvalorspredits. A més,esdesenvolupa un mètode de selecció de variables per triarels gens mésrellevants, jaque la majoriadels gens analitzatsamb un microarraysónirrellevants per al propòsit particular de classificacióI podenconfondre el classificador. Finalment, el DPLSs'estenen a la classificació multi-classemitjançant la combinació de PLS ambl'anàlisidiscriminant lineal.

APA, Harvard, Vancouver, ISO, and other styles

22

Hajimohammadi, Hamid Reza. "Classification of Data Series at Vehicle Detection." Thesis, Uppsala University, Department of Information Technology, 2009. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-111163.

Full text

Abstract:

This paper purposes a new, simple and lightweight approach of previously studied algorithms that can be used for extracting of feature vectors that in turn enables one to classify a vehicle based on its magnetic signature shape.This algorithm is called ASWA that stands for Adaptive Spectral and Wavelet Analysis and it is a combination of features of a signal extracted by both of the spectral and wavelet analysis algorithms. The performance of classifiers using this feature vectors is compared to another feature vectors consisting of features extracted by Fourier transform and pattern information of the signal extracted by Hill-Pattern algorithm (CFTHP). By using ASWA-based feature vectors, there have been improvements in all of classification algorithms results such as K-Nearest Neighbors (KNN), Support Vector Machine (SVM) and Probabilistic Neural Networks (PNN). However, the best improvement rate achieved using an ASWA-Based feature vectors in K-NN algorithm. The correct rate of the classifier using CFTHP-based feature vectors was 39.82 %, which have improved to 69.93 % by using ASWA. This is corresponding an overall improvement by 76 % in correct classification rates.

APA, Harvard, Vancouver, ISO, and other styles

23

Selmer, Oyvind, and Mikael Brevik. "Classification and Visualisation of Twitter Sentiment Data." Thesis, Norges teknisk-naturvitenskapelige universitet, Institutt for datateknikk og informasjonsvitenskap, 2013. http://urn.kb.se/resolve?urn=urn:nbn:no:ntnu:diva-22967.

Full text

Abstract:

The social micro-blog site Twitter grows in user base each day and has become an attractive platform for companies, politicians, marketeers, and others wishing to share information and/or opinions. With a growing user market for Twitter, more and more systems and research are released for taking advantage of its informal nature and doing opinion mining and sentiment analysis. This master thesis describes a system for doing Sentiment Analysis on Twitter data and experiments with grid searches on various combinations of machine learning algorithms, features and preprocessing methods to achieve so. The classification system is fairly domain independent and performs better than baseline. This system is designed to be fast enough to classify big amounts of data and tweets in a stream, and provides an application program interface (API) to easily transfer data to applications or end users. Three visualisation applications are implemented, showing how to use the API and providing examples of how sentiment data can be used.The main contributions are: C1: A literary study of the state-of-the-art for Twitter Sentiment Analysis.C2: The implementation of a general system architecture for doing Twitter Sentiment Analysis. C3: A comparison of different machine learning algorithms for the task of identifying sentiments in short messages in a fairly semi-independent domain.C4: Implementations of a set of visualisation applications, showing how to use data from the generic system and providing examples of how to present sentiment analysis data.

APA, Harvard, Vancouver, ISO, and other styles

24

Gao, Ming. "A study on imbalanced data classification problems." Thesis, University of Reading, 2013. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.602707.

Full text

Abstract:

This thesis focuses on the study of machine learning and pattern recognition algorithms for imbalanced data problems. The imbalanced problems are important as they are prevalent in life threatening/safety critical applications. They are known to be problematic to standard machine learning algorithms due to the imbalanced distribution between positive and negative classes. My original contribution to knowledge in this field is fourfold. A powerful and efficient algorithm for solving two-class imbalanced problems is proposed. The proposed method combines the synthetic minority over-sampling technique and the radial basis function classifier optimised by particle swarm optimization to enhance the classifier's performance for imbalanced learning. An over-sampling technique for imbalanced problems, probability density function estimation based over-sampling, is proposed. In contrast to existing over-sampling techniques that lack sufficient theoretical insights and justifications, the synthetic data samples are generated from the estimated probability density function from the positive data via the Parzen-window. A unified neurofuzzy modelling scheme is proposed. A novel initial rule construction method on the subspaces of the input features is formed. The supervised subspace orthogonal least square learning for model construction is applied. A logistic regression model is formed to present the classifiers output. Based on the formation of the unified neurofuzzy model, a new class of neurofuzzy construction algorithms is proposed with the aim of maximizing generalization capability specifically for imbalanced data classification based on leave-one-out cross-validation.

APA, Harvard, Vancouver, ISO, and other styles

25

Acosta, Mena Dionisio M. "Statistical classification of magnetic resonance imaging data." Thesis, University of Sussex, 2001. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.390913.

Full text

APA, Harvard, Vancouver, ISO, and other styles

26

Berry, Ian Michael. "Data classification using unsupervised artificial neural networks." Thesis, University of Sussex, 1997. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.390079.

Full text

APA, Harvard, Vancouver, ISO, and other styles

27

Hou, Jun. "Function Approximation and Classification with Perturbed Data." The Ohio State University, 2021. http://rave.ohiolink.edu/etdc/view?acc_num=osu1618266875924225.

Full text

APA, Harvard, Vancouver, ISO, and other styles

28

Al-Madi, Naila Shikri. "Improved Genetic Programming Techniques For Data Classification." Diss., North Dakota State University, 2014. https://hdl.handle.net/10365/27097.

Full text

Abstract:

Evolutionary algorithms are one category of optimization techniques that are inspired by processes of biological evolution. Evolutionary computation is applied to many domains and one of the most important is data mining. Data mining is a relatively broad field that deals with the automatic knowledge discovery from databases and it is one of the most developed fields in the area of artificial intelligence. Classification is a data mining method that assigns items in a collection to target classes with the goal to accurately predict the target class for each item in the data. Genetic programming (GP) is one of the effective evolutionary computation techniques to solve classification problems. GP solves classification problems as an optimization tasks, where it searches for the best solution with highest accuracy. However, GP suffers from some weaknesses such as long execution time, and the need to tune many parameters for each problem. Furthermore, GP can not obtain high accuracy for multiclass classification problems as opposed to binary problems. In this dissertation, we address these drawbacks and propose some approaches in order to overcome them. Adaptive GP variants are proposed in order to automatically adapt the parameter settings and shorten the execution time. Moreover, two approaches are proposed to improve the accuracy of GP when applied to multiclass classification problems. In addition, a Segment-based approach is proposed to accelerate the GP execution time for the data classification problem. Furthermore, a parallelization of the GP process using the MapReduce methodology was proposed which aims to shorten the GP execution time and to provide the ability to use large population sizes leading to a faster convergence. The proposed approaches are evaluated using different measures, such as accuracy, execution time, sensitivity, specificity, and statistical tests. Comparisons between the proposed approaches with the standard GP, and with other classification techniques were performed, and the results showed that these approaches overcome the drawbacks of standard GP by successfully improving the accuracy and execution time.

APA, Harvard, Vancouver, ISO, and other styles

29

Varnavas, Andreas Soteriou. "Signal processing methods for EEG data classification." Thesis, Imperial College London, 2008. http://hdl.handle.net/10044/1/11943.

Full text

APA, Harvard, Vancouver, ISO, and other styles

30

Hyun, Jung Kim. "Classification in thoracic computed tomography image data." Diss., Restricted to subscribing institutions, 2007. http://proquest.umi.com/pqdweb?did=1383469071&sid=1&Fmt=2&clientId=1564&RQT=309&VName=PQD.

Full text

APA, Harvard, Vancouver, ISO, and other styles

31

Lee, K. K. "Classification of imbalanced data with transparent kernels." Thesis, University of Southampton, 2002. https://eprints.soton.ac.uk/257937/.

Full text

APA, Harvard, Vancouver, ISO, and other styles

32

Kazakeviciute, Agne. "Some theoretical essays on functional data classification." Thesis, University College London (University of London), 2017. http://discovery.ucl.ac.uk/1570359/.

Full text

Abstract:

Functional data analysis is a fast-growing research area in statistics, dealing with statistical analysis of infinite-dimensional (functional) data. For many pattern recognition problems with finite-dimensional data there usually exists a solid theoretical foundation, for example, it is known under which assumptions various classifiers have desirable theoretical properties, such as consistency. Therefore, a natural interest is to extend the theory to the setting of infinite-dimensional data. The thesis is written in two directions: one is when we observe full curves, and the other is when we observe sparse and irregular curves. In the first direction, the main goal is to give a justification for a logistic classifier, where only the projection of the parameter function on some subspace is estimated via maximum quasi-likelihood and the rest of its coordinates are set to zero. This is preceded with studying the problem of detecting sample point separation in logistic regression–the case in which the maximum quasi-likelihood estimate of the model parameter does not exist or is not unique. In the other direction, a problem of extending sparsely and irregularly sampled functional data to full curves is considered so that potentially the theory from the first research direction could be applied in the future. There are several contributions of this thesis. First, it is proved that the separating hyperplane can be found from a finite set of candidates, and an upper bound of the probability of point separation is given. Second, the assumptions under which the logistic classifier is consistent are established, although simulation studies reveal that some assumptions are not necessary and may be relaxed. Thirdly, the thesis proposes a collaborative curve extension method, which is proven to be consistent under certain assumptions.

APA, Harvard, Vancouver, ISO, and other styles

33

DEMNI, Houyem. "Depth-based classification approaches for directional data." Doctoral thesis, Università degli studi di Cassino, 2021. http://hdl.handle.net/11580/83781.

Full text

Abstract:

Supervised learning tasks aim to define a data-based rule by which new objects are assigned to one of the given classes. To this end, a training set containing objects with known memberships is exploited. Directional data are points lying on the surface of circles, spheres or hyper-spheres. Given that they lie on a non-linear manifold, directional observations require specific methods to be analyzed. In this thesis, the main interest is to present novel methodologies and to perform reliable inferences for directional data, within the framework of supervised classification. First, a supervised classification procedure for directional data is introduced. The procedure is based on the cumulative distribution of the cosine depth, that is a directional distance-based depth function. The proposed method is compared with the max-depth classifier, a well-known depth-based classifier within the literature, through simulations and a real data example. Second, we study the optimality of the depth distribution and the max-depth classifiers from a theoretical perspective. More specifically, we investigate the necessary conditions under which the classifiers are optimal in the sense of the optimal Bayes rule. Then, we study the robustness of some directional depth-based classifiers in the presence of contaminated data. The performance of the depth distribution classifier, the max-depth classifier and the DD-classifier is evaluated by means of simulations in the presence of both class and attribute noise. Finally, the last part of the thesis is devoted to evaluate the performance of depth-based classifiers on a real directional data set.

APA, Harvard, Vancouver, ISO, and other styles

34

Palanisamy, Senthil Kumar. "Association rule based classification." Link to electronic thesis, 2006. http://www.wpi.edu/Pubs/ETD/Available/etd-050306-131517/.

Full text

Abstract:

Thesis (M.S.)--Worcester Polytechnic Institute.
Keywords: Itemset Pruning, Association Rules, Adaptive Minimal Support, Associative Classification, Classification. Includes bibliographical references (p.70-74).

APA, Harvard, Vancouver, ISO, and other styles

35

Chan, Wing-yan Sarah, and 陳詠欣. "Emerging substrings for sequence classification." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2003. http://hub.hku.hk/bib/B2971672X.

Full text

APA, Harvard, Vancouver, ISO, and other styles

36

Ramaboa, Kutlwano K. K. M. "A comparative evaluation of data mining classification techniques on medical trauma data." Master's thesis, University of Cape Town, 2004. http://hdl.handle.net/11427/5973.

Full text

Abstract:

Includes bibliographical references (leaves 109-113).
The purpose of this research was to determine the extent to which a selection of data mining classification techniques (specifically, Discriminant Analysis, Decision Trees, and three artifical neural network models - Backpropogation, Probablilistic Neural Networks, and the Radial Basis Function) are able to correctly classify cases into the different categories of an outcome measure from a given set of input variables (i.e. estimate their classification accuracy) on a common database.

APA, Harvard, Vancouver, ISO, and other styles

37

Lundgren, Andreas. "Data-Driven Engine Fault Classification and Severity Estimation Using Residuals and Data." Thesis, Linköpings universitet, Fordonssystem, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-165736.

Full text

Abstract:

Recent technological advances in the automotive industry have made vehicularsystems increasingly complex in terms of both hardware and software. As thecomplexity of the systems increase, so does the complexity of efficient monitoringof these system. With increasing computational power the field of diagnosticsis becoming evermore focused on software solutions for detecting and classifyinganomalies in the supervised systems. Model-based methods utilize knowledgeabout the physical system to device nominal models of the system to detect deviations,while data-driven methods uses historical data to come to conclusionsabout the present state of the system in question. This study proposes a combinedmodel-based and data-driven diagnostic framework for fault classification,severity estimation and novelty detection. An algorithm is presented which uses a system model to generate a candidate setof residuals for the system. A subset of the residuals are then selected for eachfault using L1-regularized logistic regression. The time series training data fromthe selected residuals is labelled with fault and severity. It is then compressedusing a Gaussian parametric representation, and data from different fault modesare modelled using 1-class support vector machines. The classification of datais performed by utilizing the support vector machine description of the data inthe residual space, and the fault severity is estimated as a convex optimizationproblem of minimizing the Kullback-Leibler divergence (kld) between the newdata and training data of different fault modes and severities. The algorithm is tested with data collected from a commercial Volvo car enginein an engine test cell and the results are presented in this report. Initial testsindicate the potential of the kld for fault severity estimation and that noveltydetection performance is closely tied to the residual selection process.

APA, Harvard, Vancouver, ISO, and other styles

38

Pruengkarn, Ratchakoon. "Enhancing classification performance by handling noise and imbalanced data with fuzzy classification techniques." Thesis, Pruengkarn, Ratchakoon (2018) Enhancing classification performance by handling noise and imbalanced data with fuzzy classification techniques. PhD thesis, Murdoch University, 2018. https://researchrepository.murdoch.edu.au/id/eprint/42505/.

Full text

Abstract:

This thesis studied the methodologies to improve the quality of training data in order to enhance classification performance. Noise and imbalance problems are two significant factors affecting data quality. Class noise is considered as the most harmful type of noise to a classifier’s performance, since incorrectly labelled examples may severely bias the learning method and result in inaccurate models. Removing mislabelled instances is more efficient than repairing and relabelling them. However, excessive removal of instances can be the cause of serious and irremediable loss of information. Under any circumstance, maintaining the noisy instances is worse than over eliminating. For these reasons, the conservation of instances without excessive filtering must be considered. Therefore, in the first part of this study, a noise removal technique using the Complementary technique with the Fuzzy Support Vector Machine (CMTFSVM) is proposed by considering misclassification analysis, in order to eliminate high potential uncertainty instances, which could lead to the mislabeling of samples in the training data. The results indicated that the CMTFSVM can reduce class noise and enhance the classification accuracy across different learning algorithms: Neural Network (NN), Support Vector Machine (SVM) and Fuzzy Support Vector Machine (FSVM). On the other hand, there exists the imbalance issue, which is the cause of poor performance for existing learning algorithms. In such situations, there are some classes which have their number of instances greater than the other classes. Traditional learning algorithms tend to be overwhelmed by the majority classes and ignore the minority classes. The minority classes are as important despite their rareness, as they can contain useful information, as well as being difficult to recognise because of their infrequency and casualness. The second part of this study is designed to overcome this bias, by utilising a combination of the CMTFSVM undersampling technique, along with the Synthetic Minority Over-sampling Technique (SMOTE) rebalancing technique called CMTSMT. CMTSMT has been proposed for handling the binary imbalance problem in order to filter the uncertainty instances out of the training datasets, as well as to promote the importance of the minority classes. The results revealed that CMTSMT can improve classification performance with various imbalance ratios approximately 96% and 40% in terms of Geometric Mean (G-mean) and Area under the Receiver Operating Characteristic (AUC). Another type of imbalance problem is dealing with multiclass imbalance classification. Multiclass learning has been seen as a difficult task for classification algorithms as multiclass classification may have a significantly lower performance than binary cases. Most existing techniques applied directly to a binary class imbalance problem could not be applied directly to multiclass problems. In addition, between-class and within-class are the two main factors causing issues for learning algorithms. Decomposition techniques such as One-vs-One and One-vs-All are common techniques to deal with multiclass imbalance data. However, their drawbacks are the losing of balancing performance on all classes, as they require a high memory space and more classifiers. Thus, a hybrid Fuzzy C-Means clustering (FCM) and SMOTE called FCMSMT is proposed. The results presented that the FCMSMT technique could reduce between-class and within-class problems by balancing all the classes to have a similar number of class instances and randomly selecting instances (at least one) from each cluster. Moreover, the number of instances, after applying the FCMSMT technique, are of a similar number to the original dataset instances, in order to prevent an over undersampling and oversampling of class instances. The percentage of performance improvement between the original data and the FCMSMT technique with highly imbalanced data approximately 10% and 5% with G-mean and AUC respectively. Thus, the FCMSMT technique could be an alternative way to deal with the multiclass imbalance classification problem.

APA, Harvard, Vancouver, ISO, and other styles

39

Klose, Aljoscha Alexander. "Partially supervised learning of fuzzy classification rules." [S.l. : s.n.], 2004. http://deposit.ddb.de/cgi-bin/dokserv?idn=971682364.

Full text

APA, Harvard, Vancouver, ISO, and other styles

40

Phillips, Rhonda D. "A Probabilistic Classification Algorithm With Soft Classification Output." Diss., Virginia Tech, 2009. http://hdl.handle.net/10919/26701.

Full text

Abstract:

This thesis presents a shared memory parallel version of the hybrid classification algorithm IGSCR (iterative guided spectral class rejection), a novel data reduction technique that can be used in conjunction with PIGSCR (parallel IGSCR), a noise removal method based on the maximum noise fraction (MNF), and a continuous version of IGSCR (CIGSCR) that outputs soft classifications. All of the above are either classification algorithms or preprocessing algorithms necessary prior to the classification of high dimensional, noisy images. PIGSCR was developed to produce fast and portable code using Fortran 95, OpenMP, and the Hierarchical Data Format version 5 (HDF5) and accompanying data access library. The feature reduction method introduced in this thesis is based on the singular value decomposition (SVD). This feature reduction technique demonstrated that SVD-based feature reduction can lead to more accurate IGSCR classifications than PCA-based feature reduction. This thesis describes a new algorithm used to adaptively filter a remote sensing dataset based on signal-to-noise ratios (SNRs) once the maximum noise fraction (MNF) has been applied. The adaptive filtering scheme improves image quality as shown by estimated SNRs and classification accuracy improvements greater than 10%. The continuous iterative guided spectral class rejection (CIGSCR) classification method is based on the iterative guided spectral class rejection (IGSCR) classification method for remotely sensed data. Both CIGSCR and IGSCR use semisupervised clustering to locate clusters that are associated with classes in a classification scheme. This type of semisupervised classification method is particularly useful in remote sensing where datasets are large, training data are difficult to acquire, and clustering makes the identification of subclasses adequate for training purposes less difficult. Experimental results indicate that the soft classification output by CIGSCR is reasonably accurate (when compared to IGSCR), and the fundamental algorithmic changes in CIGSCR (from IGSCR) result in CIGSCR being less sensitive to input parameters that influence iterations.
Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

41

Bressan, Marco José Miguel. "Statistical Independence for classification for High Dimensional Data." Doctoral thesis, Universitat Autònoma de Barcelona, 2003. http://hdl.handle.net/10803/3034.

Full text

APA, Harvard, Vancouver, ISO, and other styles

42

Röder, Tido. "Similarity, retrieval, and classification of motion capture data." [S.l.] : [s.n.], 2006. http://deposit.ddb.de/cgi-bin/dokserv?idn=983632332.

Full text

APA, Harvard, Vancouver, ISO, and other styles

43

Cho, Hansang. "Classification of functional brain data for multimedia retrieval /." Thesis, Connect to this title online; UW restricted, 2005. http://hdl.handle.net/1773/5892.

Full text

APA, Harvard, Vancouver, ISO, and other styles

44

Brandin, Martin, and Roger Hamrén. "Classification of Ground Objects Using Laser Radar Data." Thesis, Linköping University, Department of Electrical Engineering, 2003. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-1572.

Full text

Abstract:

Accurate 3D models of natural environments are important for many modelling and simulation applications, for both civilian and military purposes. When building 3D models from high resolution data acquired by an airborne laser scanner it is de-sirable to separate and classify the data to be able to process it further. For example, to build a polygon model of a building the samples belonging to the building must be found.

In this thesis we have developed, implemented (in IDL and ENVI), and evaluated algorithms for classification of buildings, vegetation, power lines, posts, and roads. The data is gridded and interpolated and a ground surface is estimated before the classification. For the building classification an object based approach was used unlike most classification algorithms which are pixel based. The building classifica-tion has been tested and compared with two existing classification algorithms.

The developed algorithm classified 99.6 % of the building pixels correctly, while the two other algorithms classified 92.2 % respective 80.5 % of the pixels correctly. The algorithms developed for the other classes were tested with thefollowing result (correctly classified pixels): vegetation, 98.8 %; power lines, 98.2 %; posts, 42.3 %; roads, 96.2 %.

APA, Harvard, Vancouver, ISO, and other styles

45

Zhao, Lei. "Learning from noisy data: Robust data classification." Thesis, 2012. http://researchonline.federation.edu.au/vital/access/HandleResolver/1959.17/65174.

Full text

Abstract:

The problem of learning from noisy data sets has been the focus of much attention for many years. Three different types of noise could be defined that generate difficulties in data classification. The first type is related to the noisy features and labels where data entry and data acquisition are inherently prone to errors. The second type is from the redundant features, which may confuse the classification algorithm and degrade the classification performance. The last type could be generated by insufficient features where some features may become quite ambiguous in the absence of related hidden complementary features. In order to address these problems, robust methods for data classification have been studied in many areas, such as bio-informatics, genetics, medicine, education and electronic engineering. This thesis aims to study classification methods that are robust for noisy data sets. Different problems caused by the three types of noise listed above are investigated. New robust methods for data classification are proposed. "From Abstract"
Doctor of Philosophy

APA, Harvard, Vancouver, ISO, and other styles

46

Yu, Hsin-Min, and 余欣珉. "Applying Support Vector Data Description For Data Classification." Thesis, 2013. http://ndltd.ncl.edu.tw/handle/61401033111710951818.

Full text

Abstract:

碩士
朝陽科技大學
工業工程與管理系碩士班
101
Support Vector Data Description (SVDD) was developed by Tax and Duin in 1999. The objective of SVDD is to obtain a shaped decision boundary with minimum volume around a dataset. SVDD was firstly developed to detecting outliers. In this study, the SVDD will be adopted as a classification tool. The SVDD is unlimited to the data assumption. Moreover, the decision boundary is formed by Support Vectors (SVs) which are obtained from solving convex quadratic programming problem. This study aims at evaluating the impacts of preprocessing methods on the SVDD classification efficiency. The evaluated preprocessing methods are the widely used dimension reduction techniques, including Principal Component Analysis (PCA) and Independent Component Analysis (ICA). Three real cases will be implemented. Among which, both causes of gender prediction and mobile phone process are the continuous typed datasets. The other case related to nosocomial infection detection, that is the case from Taichung General Veteran hospital and it is a discrete typed dataset. From Kappa analysis, results demonstrated that SVDD without using preprocessing methods can pose higher classification consistence and lower misclassification rates.

APA, Harvard, Vancouver, ISO, and other styles

47

HO, MING-HSUAN, and 何明璇. "Classification of microarray data using fuzzy classification association rules." Thesis, 2010. http://ndltd.ncl.edu.tw/handle/30548430326192789242.

Full text

Abstract:

碩士
國立臺灣科技大學
資訊管理系
98
With the advent of microarray technology, people can now measure thousands of gene’s expressions simultaneously in one experiment. The powerful microarray technology helps lay the foundation for bioinformatics and is widely used in disease diagnosis. In this thesis, we use fuzzy classification association rules to study the relationship between gene expressions and diseases in microarray data. In the proposed method, we first divide the universe of discourse of each gene expression in microarray data into several intervals, and define a membership function for each interval. Then, we fuzzify the original microarray data against the gene intervals. Finally, we use the Apriori algorithm to derive a set of classification association rules for each class of the microarray data. When classifying a test sample, we calculate the membership degree of the sample against all the derived rules. The sample belongs to the class against which it has the largest membership degree. Compared with the existing classification methods for microarray data which attain high prediction accuracy with little interpretability, the proposed method attains comparable prediction accuracy with significant improvement on interpretability. Therefore, it can help the study of cancers and improve the efficiency of disease diagnosis. In this research, we use three well-known microarray data sets to compare the performance of the proposed method with the decision tree induction method. The experimental results show that the proposed method significantly outperforms the decision tree induction method in prediction accuracy.

APA, Harvard, Vancouver, ISO, and other styles

48

"Data Compression by Unsupervised Classification." Department of Statistics and Mathematics, 1997. http://epub.wu-wien.ac.at/dyn/dl/wp/epub-wu-01_a2f.

Full text

APA, Harvard, Vancouver, ISO, and other styles

49

Zhang, Xin. "Classification in the missing data." Master's thesis, 2010. http://hdl.handle.net/10048/1290.

Full text

Abstract:

Missing data is always a problem when it comes to data analysis. This is especially the case in anthropology when sex determination is one of the primary goals for fossil skull data since many measurements were not available. We expect to find a classifier that can handle the large amount of missingness and improve the ability of prediction/classification as well. These are the objectives of this thesis. Besides of the crude methods (ignore cases with missingness), three possible techniques in handling of missing values are discussed: bootstrap imputation, weighted-averaging classifier and classification trees. All these methods do make use of all the cases in data and can handle any cases with missingness. The diabetes data and fossil skull data are used to compare the performance of different methods regarding to misclassification error rate. Each method has its own advantages and certain situations under which better performance will be achieved.
Statistics

APA, Harvard, Vancouver, ISO, and other styles

50

Yi, Jiang Jhih, and 姜芝怡. "An Incremental Data Classification Technique." Thesis, 2004. http://ndltd.ncl.edu.tw/handle/05970856233052098455.

Full text

Abstract:

碩士
國立清華大學
資訊系統與應用研究所
92
In this high competition age, a company has to continuously keep an eye on the latest information in order to hold the upper hand of the industry. The company may have to find the information on the mass media or on the market. They can even find useful information in their own database. The task of mining unseen information and then transforming it into the competitive strategy is essential in the data mining area. Customer relationship management system is one of the most popular data mining applications. In this study, we analyze a subsystem of a 3C retailer’s CRM System ---an eCard recommendation system. At the same time, we propose an architecture for incremental data classification. We then apply this technique to the eCard recommendation system to see whether it would perform better than the existing ones. Experimental results show that the classifier built according to the proposed method has acceptable error rate compared with the existing classifiers. Moreover, it can generate a set of rules which provide some high level semantic description about the data.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Data Classification'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles