Dissertations / Theses: 'Random Forests Classifiers'

1

Siegel, Kathryn I. (Kathryn Iris). "Incremental random forest classifiers in spark." Thesis, Massachusetts Institute of Technology, 2016. http://hdl.handle.net/1721.1/106105.

Full text

Abstract:

Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2016.
Cataloged from PDF version of thesis.
Includes bibliographical references (page 53).
The random forest is a machine learning algorithm that has gained popularity due to its resistance to noise, good performance, and training efficiency. Random forests are typically constructed using a static dataset; to accommodate new data, random forests are usually regrown. This thesis presents two main strategies for updating random forests incrementally, rather than entirely rebuilding the forests. I implement these two strategies-incrementally growing existing trees and replacing old trees-in Spark Machine Learning(ML), a commonly used library for running ML algorithms in Spark. My implementation draws from existing methods in online learning literature, but includes several novel refinements. I evaluate the two implementations, as well as a variety of hybrid strategies, by recording their error rates and training times on four different datasets. My benchmarks show that the optimal strategy for incremental growth depends on the batch size and the presence of concept drift in a data workload. I find that workloads with large batches should be classified using a strategy that favors tree regrowth, while workloads with small batches should be classified using a strategy that favors incremental growth of existing trees. Overall, the system demonstrates significant efficiency gains when compared to the standard method of regrowing the random forest.
by Kathryn I. Siegel.
M. Eng.

APA, Harvard, Vancouver, ISO, and other styles

2

Nygren, Rasmus. "Evaluation of hyperparameter optimization methods for Random Forest classifiers." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-301739.

Full text

Abstract:

In order to create a machine learning model, one is often tasked with selecting certain hyperparameters which configure the behavior of the model. The performance of the model can vary greatly depending on how these hyperparameters are selected, thus making it relevant to investigate the effects of hyperparameter optimization on the classification accuracy of a machine learning model. In this study, we train and evaluate a Random Forest classifier whose hyperparameters are set to default values and compare its classification accuracy to another classifier whose hyperparameters are obtained through the use of the hyperparameter optimization (HPO) methods Random Search, Bayesian Optimization and Particle Swarm Optimization. This is done on three different datasets, and each HPO method is evaluated based on the classification accuracy change it induces across the datasets. We found that every HPO method yielded a total classification accuracy increase of approximately 2-3% across all datasets compared to the accuracies obtained using the default hyperparameters. However, due to limitations of time, data and computational resources, no assertions can be made as to whether the observed positive effect is generalizable at a larger scale. Instead, we could conclude that the utility of HPO methods is dependent on the dataset at hand.
För att skapa en maskininlärningsmodell behöver en ofta välja olika hyperparametrar som konfigurerar modellens egenskaper. Prestandan av en sådan modell beror starkt på valet av dessa hyperparametrar, varför det är relevant att undersöka hur optimering av hyperparametrar kan påverka klassifikationssäkerheten av en maskininlärningsmodell. I denna studie tränar och utvärderar vi en Random Forest-klassificerare vars hyperparametrar sätts till särskilda standardvärden och jämför denna med en klassificerare vars hyperparametrar bestäms av tre olika metoder för optimering av hyperparametrar (HPO) - Random Search, Bayesian Optimization och Particle Swarm Optimization. Detta görs på tre olika dataset, och varje HPO- metod utvärderas baserat på den ändring av klassificeringsträffsäkerhet som den medför över dessa dataset. Vi fann att varje HPO-metod resulterade i en total ökning av klassificeringsträffsäkerhet på cirka 2-3% över alla dataset jämfört med den träffsäkerhet som kruleslassificeraren fick med standardvärdena för hyperparametrana. På grund av begränsningar i form av tid och data kunde vi inte fastställa om den positiva effekten är generaliserbar till en större skala. Slutsatsen som kunde dras var istället att användbarheten av metoder för optimering av hyperparametrar är beroende på det dataset de tillämpas på.

APA, Harvard, Vancouver, ISO, and other styles

3

Sandsveden, Daniel. "Evaluation of Random Forests for Detection and Localization of Cattle Eyes." Thesis, Linköpings universitet, Datorseende, 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-121540.

Full text

Abstract:

In a time when cattle herds grow continually larger the need for automatic methods to detect diseases is ever increasing. One possible method to discover diseases is to use thermal images and automatic head and eye detectors. In this thesis an eye detector and a head detector is implemented using the Random Forests classifier. During the implementation the classifier is evaluated using three different descriptors: Histogram of Oriented Gradients, Local Binary Patterns, and a descriptor based on pixel differences. An alternative classifier, the Support Vector Machine, is also evaluated for comparison against Random Forests. The thesis results show that Histogram of Oriented Gradients performs well as a description of cattle heads, while Local Binary Patterns performs well as a description of cattle eyes. The provided descriptor performs almost equally well in both cases. The results also show that Random Forests performs approximately as good as the Support Vector Machine, when the Support Vector Machine is paired with Local Binary Patterns for both heads and eyes. Finally the thesis results indicate that it is easier to detect and locate cattle heads than it is to detect and locate cattle eyes. For eyes, combining a head detector and an eye detector is shown to give a better result than only using an eye detector. In this combination heads are first detected in images, followed by using the eye detector in areas classified as heads.

APA, Harvard, Vancouver, ISO, and other styles

4

Abd, El Meguid Mostafa. "Unconstrained facial expression recognition in still images and video sequences using Random Forest classifiers." Thesis, McGill University, 2012. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=107692.

Full text

Abstract:

The aim of this project is to construct and implement a comprehensive facial expression detection and classification framework through the use of a proprietary face detector (PittPatt) and a novel classifier consisting of a set of Random Forests paired with either support vector machine or k-nearest neighbour labellers. The system should perform at real-time rates under unconstrained image conditions, with no intermediate human intervention. The still-image Binghamton University 3D Facial Expression database was used for training purposes, while a number of other expression-labelled video databases were used for testing. Quantitative evidence for qualitative and intuitive facial expression recognition constitutes the main theoretical contribution to the field.
L'objectif de ce projet est de construire et mettre en œuvre un cadre complète de détection de l'expression du visage par l'utilisation d'un détecteur de visage exclusif (PittPatt) et un nouveau classificateur composé d'un ensemble de 'Random Forests' a accompagné d'un étiqueteur 'support vector machine' ou 'k-nearest neighbour'. Le système doit effectuer au temps réel, dans des conditions sans contrainte, sans aucune intervention humaine intermédiaires. La base de données d'images fixes 'Binghamton University 3D Facial Expressions' était utilisé à des fins de formation. Un nombre de bases de données d'expression d'images fixes et de vidéo ont été utilisés pour l'évaluation. Des données quantitatives pour l'analyse qualitative, et parfois intuitive, les sujets liés à l'expression faciale constituaient la contribution principale et théorique sur le terrain.

APA, Harvard, Vancouver, ISO, and other styles

5

Sjöqvist, Hugo. "Classifying Forest Cover type with cartographic variables via the Support Vector Machine, Naive Bayes and Random Forest classifiers." Thesis, Örebro universitet, Handelshögskolan vid Örebro Universitet, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:oru:diva-58384.

Full text

APA, Harvard, Vancouver, ISO, and other styles

6

Halmann, Marju. "Email Mining Classifier : The empirical study on combining the topic modelling with Random Forest classification." Thesis, Högskolan i Skövde, Institutionen för informationsteknologi, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:his:diva-14710.

Full text

Abstract:

Filtering out and replying automatically to emails are of interest to many but is hard due to the complexity of the language and to dependencies of background information that is not present in the email itself. This paper investigates whether Latent Dirichlet Allocation (LDA) combined with Random Forest classifier can be used for the more general email classification task and how it compares to other existing email classifiers. The comparison is based on the literature study and on the empirical experimentation using two real-life datasets. Firstly, a literature study is performed to gain insight of the accuracy of other available email classifiers. Secondly, proposed model’s accuracy is explored with experimentation. The literature study shows that the accuracy of more general email classifiers differs greatly on different user sets. The proposed model accuracy is within the reported accuracy range, however in the lower part. It indicates that the proposed model performs poorly compared to other classifiers. On average, the classifier performance improves 15 percentage points with additional information. This indicates that Latent Dirichlet Allocation (LDA) combined with Random Forest classifier is promising, however future studies are needed to explore the model and ways to further increase the accuracy.

APA, Harvard, Vancouver, ISO, and other styles

7

Zhang, Qing Frankowski Ralph. "An empirical evaluation of the random forests classifier models for variable selection in a large-scale lung cancer case-control study /." See options below, 2006. http://proquest.umi.com/pqdweb?did=1324365481&sid=1&Fmt=2&clientId=68716&RQT=309&VName=PQD.

Full text

APA, Harvard, Vancouver, ISO, and other styles

8

Xia, Junshi. "Multiple classifier systems for the classification of hyperspectral data." Thesis, Grenoble, 2014. http://www.theses.fr/2014GRENT047/document.

Full text

Abstract:

Dans cette thèse, nous proposons plusieurs nouvelles techniques pour la classification d'images hyperspectrales basées sur l'apprentissage d'ensemble. Le cadre proposé introduit des innovations importantes par rapport aux approches précédentes dans le même domaine, dont beaucoup sont basées principalement sur un algorithme individuel. Tout d'abord, nous proposons d'utiliser la Forêt de Rotation (Rotation Forest) avec différentes techiniques d'extraction de caractéristiques linéaire et nous comparons nos méthodes avec les approches d'ensemble traditionnelles, tels que Bagging, Boosting, Sous-espace Aléatoire et Forêts Aléatoires. Ensuite, l'intégration des machines à vecteurs de support (SVM) avec le cadre de sous-espace de rotation pour la classification de contexte est étudiée. SVM et sous-espace de rotation sont deux outils puissants pour la classification des données de grande dimension. C'est pourquoi, la combinaison de ces deux méthodes peut améliorer les performances de classification. Puis, nous étendons le travail de la Forêt de Rotation en intégrant la technique d'extraction de caractéristiques locales et l'information contextuelle spatiale avec un champ de Markov aléatoire (MRF) pour concevoir des méthodes spatio-spectrale robustes. Enfin, nous présentons un nouveau cadre général, ensemble de sous-espace aléatoire, pour former une série de classifieurs efficaces, y compris les arbres de décision et la machine d'apprentissage extrême (ELM), avec des profils multi-attributs étendus (EMaPS) pour la classification des données hyperspectrales. Six méthodes d'ensemble de sous-espace aléatoire, y compris les sous-espaces aléatoires avec les arbres de décision, Forêts Aléatoires (RF), la Forêt de Rotation (RoF), la Forêt de Rotation Aléatoires (Rorf), RS avec ELM (RSELM) et sous-espace de rotation avec ELM (RoELM), sont construits par multiples apprenants de base. L'efficacité des techniques proposées est illustrée par la comparaison avec des méthodes de l'état de l'art en utilisant des données hyperspectrales réelles dans de contextes différents
In this thesis, we propose several new techniques for the classification of hyperspectral remote sensing images based on multiple classifier system (MCS). Our proposed framework introduces significant innovations with regards to previous approaches in the same field, many of which are mainly based on an individual algorithm. First, we propose to use Rotation Forests with several linear feature extraction and compared them with the traditional ensemble approaches, such as Bagging, Boosting, Random subspace and Random Forest. Second, the integration of the support vector machines (SVM) with Rotation subspace framework for context classification is investigated. SVM and Rotation subspace are two powerful tools for high-dimensional data classification. Therefore, combining them can further improve the classification performance. Third, we extend the work of Rotation Forests by incorporating local feature extraction technique and spatial contextual information with Markov random Field (MRF) to design robust spatial-spectral methods. Finally, we presented a new general framework, Random subspace ensemble, to train series of effective classifiers, including decision trees and extreme learning machine (ELM), with extended multi-attribute profiles (EMAPs) for classifying hyperspectral data. Six RS ensemble methods, including Random subspace with DT (RSDT), Random Forest (RF), Rotation Forest (RoF), Rotation Random Forest (RoRF), RS with ELM (RSELM) and Rotation subspace with ELM (RoELM), are constructed by the multiple base learners. The effectiveness of the proposed techniques is illustrated by comparing with state-of-the-art methods by using real hyperspectral data sets with different contexts

APA, Harvard, Vancouver, ISO, and other styles

9

Pettersson, Anders. "High-Dimensional Classification Models with Applications to Email Targeting." Thesis, KTH, Matematisk statistik, 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-168203.

Full text

Abstract:

Email communication is valuable for any modern company, since it offers an easy mean for spreading important information or advertising new products, features or offers and much more. To be able to identify which customers that would be interested in certain information would make it possible to significantly improve a company's email communication and as such avoiding that customers start ignoring messages and creating unnecessary badwill. This thesis focuses on trying to target customers by applying statistical learning methods to historical data provided by the music streaming company Spotify. An important aspect was the high-dimensionality of the data, creating certain demands on the applied methods. A binary classification model was created, where the target was whether a customer will open the email or not. Two approaches were used for trying to target the costumers, logistic regression, both with and without regularization, and random forest classifier, for their ability to handle the high-dimensionality of the data. Performance accuracy of the suggested models were then evaluated on both a training set and a test set using statistical validation methods, such as cross-validation, ROC curves and lift charts. The models were studied under both large-sample and high-dimensional scenarios. The high-dimensional scenario represents when the number of observations, N, is of the same order as the number of features, p and the large sample scenario represents when N ≫ p. Lasso-based variable selection was performed for both these scenarios, to study the informative value of the features. This study demonstrates that it is possible to greatly improve the opening rate of emails by targeting users, even in the high dimensional scenario. The results show that increasing the amount of training data over a thousand fold will only improve the performance marginally. Rather efficient customer targeting can be achieved by using a few highly informative variables selected by the Lasso regularization.
Företag kan använda e-mejl för att på ett enkelt sätt sprida viktig information, göra reklam för nya produkter eller erbjudanden och mycket mer, men för många e-mejl kan göra att kunder slutar intressera sig för innehållet, genererar badwill och omöjliggöra framtida kommunikation. Att kunna urskilja vilka kunder som är intresserade av det specifika innehållet skulle vara en möjlighet att signifikant förbättra ett företags användning av e-mejl som kommunikationskanal. Denna studie fokuserar på att urskilja kunder med hjälp av statistisk inlärning applicerad på historisk data tillhandahållen av musikstreaming-företaget Spotify. En binärklassificeringsmodell valdes, där responsvariabeln beskrev huruvida kunden öppnade e-mejlet eller inte. Två olika metoder användes för att försöka identifiera de kunder som troligtvis skulle öppna e-mejlen, logistisk regression, både med och utan regularisering, samt random forest klassificerare, tack vare deras förmåga att hantera högdimensionella data. Metoderna blev sedan utvärderade på både ett träningsset och ett testset, med hjälp av flera olika statistiska valideringsmetoder så som korsvalidering och ROC kurvor. Modellerna studerades under både scenarios med stora stickprov och högdimensionella data. Där scenarion med högdimensionella data representeras av att antalet observationer, N, är av liknande storlek som antalet förklarande variabler, p, och scenarion med stora stickprov representeras av att N ≫ p. Lasso-baserad variabelselektion utfördes för båda dessa scenarion för att studera informationsvärdet av förklaringsvariablerna. Denna studie visar att det är möjligt att signifikant förbättra öppningsfrekvensen av e-mejl genom att selektera kunder, även när man endast använder små mängder av data. Resultaten visar att en enorm ökning i antalet träningsobservationer endast kommer förbättra modellernas förmåga att urskilja kunder marginellt.

APA, Harvard, Vancouver, ISO, and other styles

10

Amlathe, Prakhar. "Standard Machine Learning Techniques in Audio Beehive Monitoring: Classification of Audio Samples with Logistic Regression, K-Nearest Neighbor, Random Forest and Support Vector Machine." DigitalCommons@USU, 2018. https://digitalcommons.usu.edu/etd/7050.

Full text

Abstract:

Honeybees are one of the most important pollinating species in agriculture. Every three out of four crops have honeybee as their sole pollinator. Since 2006 there has been a drastic decrease in the bee population which is attributed to Colony Collapse Disorder(CCD). The bee colonies fail/ die without giving any traditional health symptoms which otherwise could help in alerting the Beekeepers in advance about their situation. Electronic Beehive Monitoring System has various sensors embedded in it to extract video, audio and temperature data that could provide critical information on colony behavior and health without invasive beehive inspections. Previously, significant patterns and information have been extracted by processing the video/image data, but no work has been done using audio data. This research inaugurates and takes the first step towards the use of audio data in the Electronic Beehive Monitoring System (BeePi) by enabling a path towards the automatic classification of audio samples in different classes and categories within it. The experimental results give an initial support to the claim that monitoring of bee buzzing signals from the hive is feasible, it can be a good indicator to estimate hive health and can help to differentiate normal behavior against any deviation for honeybees.

APA, Harvard, Vancouver, ISO, and other styles

11

Olofsson, Nina. "A Machine Learning Ensemble Approach to Churn Prediction : Developing and Comparing Local Explanation Models on Top of a Black-Box Classifier." Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-210565.

Full text

Abstract:

Churn prediction methods are widely used in Customer Relationship Management and have proven to be valuable for retaining customers. To obtain a high predictive performance, recent studies rely on increasingly complex machine learning methods, such as ensemble or hybrid models. However, the more complex a model is, the more difficult it becomes to understand how decisions are actually made. Previous studies on machine learning interpretability have used a global perspective for understanding black-box models. This study explores the use of local explanation models for explaining the individual predictions of a Random Forest ensemble model. The churn prediction was studied on the users of Tink – a finance app. This thesis aims to take local explanations one step further by making comparisons between churn indicators of different user groups. Three sets of groups were created based on differences in three user features. The importance scores of all globally found churn indicators were then computed for each group with the help of local explanation models. The results showed that the groups did not have any significant differences regarding the globally most important churn indicators. Instead, differences were found for globally less important churn indicators, concerning the type of information that users stored in the app. In addition to comparing churn indicators between user groups, the result of this study was a well-performing Random Forest ensemble model with the ability of explaining the reason behind churn predictions for individual users. The model proved to be significantly better than a number of simpler models, with an average AUC of 0.93.
Metoder för att prediktera utträde är vanliga inom Customer Relationship Management och har visat sig vara värdefulla när det kommer till att behålla kunder. För att kunna prediktera utträde med så hög säkerhet som möjligt har den senasteforskningen fokuserat på alltmer komplexa maskininlärningsmodeller, såsom ensembler och hybridmodeller. En konsekvens av att ha alltmer komplexa modellerär dock att det blir svårare och svårare att förstå hur en viss modell har kommitfram till ett visst beslut. Tidigare studier inom maskininlärningsinterpretering har haft ett globalt perspektiv för att förklara svårförståeliga modeller. Denna studieutforskar lokala förklaringsmodeller för att förklara individuella beslut av en ensemblemodell känd som 'Random Forest'. Prediktionen av utträde studeras påanvändarna av Tink – en finansapp. Syftet med denna studie är att ta lokala förklaringsmodeller ett steg längre genomatt göra jämförelser av indikatorer för utträde mellan olika användargrupper. Totalt undersöktes tre par av grupper som påvisade skillnader i tre olika variabler. Sedan användes lokala förklaringsmodeller till att beräkna hur viktiga alla globaltfunna indikatorer för utträde var för respektive grupp. Resultaten visade att detinte fanns några signifikanta skillnader mellan grupperna gällande huvudindikatorerna för utträde. Istället visade resultaten skillnader i mindre viktiga indikatorer som hade att göra med den typ av information som lagras av användarna i appen. Förutom att undersöka skillnader i indikatorer för utträde resulterade dennastudie i en välfungerande modell för att prediktera utträde med förmågan attförklara individuella beslut. Random Forest-modellen visade sig vara signifikantbättre än ett antal enklare modeller, med ett AUC-värde på 0.93.

APA, Harvard, Vancouver, ISO, and other styles

12

Konečný, Antonín. "Využití umělé inteligence v technické diagnostice." Master's thesis, Vysoké učení technické v Brně. Fakulta strojního inženýrství, 2021. http://www.nusl.cz/ntk/nusl-443221.

Full text

Abstract:

The diploma thesis is focused on the use of artificial intelligence methods for evaluating the fault condition of machinery. The evaluated data are from a vibrodiagnostic model for simulation of static and dynamic unbalances. The machine learning methods are applied, specifically supervised learning. The thesis describes the Spyder software environment, its alternatives, and the Python programming language, in which the scripts are written. It contains an overview with a description of the libraries (Scikit-learn, SciPy, Pandas ...) and methods — K-Nearest Neighbors (KNN), Support Vector Machines (SVM), Decision Trees (DT) and Random Forests Classifiers (RF). The results of the classification are visualized in the confusion matrix for each method. The appendix includes written scripts for feature engineering, hyperparameter tuning, evaluation of learning success and classification with visualization of the result.

APA, Harvard, Vancouver, ISO, and other styles

13

Zoghi, Zeinab. "Ensemble Classifier Design and Performance Evaluation for Intrusion Detection Using UNSW-NB15 Dataset." University of Toledo / OhioLINK, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=toledo1596756673292254.

Full text

APA, Harvard, Vancouver, ISO, and other styles

14

Heidfors, Filip, and Elias Moltedo. "Maskininlärning: avvikelseklassificering på sekventiell sensordata. En jämförelse och utvärdering av algoritmer för att klassificera avvikelser i en miljövänlig IoT produkt med sekventiell sensordata." Thesis, Malmö universitet, Fakulteten för teknik och samhälle (TS), 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:mau:diva-20742.

Full text

Abstract:

Ett företag har tagit fram en miljövänlig IoT produkt med sekventiell sensordata och vill genom maskininlärning kunna klassificera avvikelser i sensordatan. Det har genom åren utvecklats ett flertal väl fungerande algoritmer för klassificering men det finns emellertid ingen algoritm som fungerar bäst för alla olika problem. Syftet med det här arbetet var därför att undersöka, jämföra och utvärdera olika klassificerare inom "supervised machine learning" för att ta reda på vilken klassificerare som ger högst träffsäkerhet att klassificera avvikelser i den typ av IoT produkt som företaget tagit fram. Genom en litteraturstudie tog vi först reda på vilka klassificerare som vanligtvis använts och fungerat bra i tidigare vetenskapliga arbeten med liknande applikationer. Vi kom fram till att jämföra och utvärdera Random Forest, Naïve Bayes klassificerare och Support Vector Machines ytterligare. Vi skapade sedan ett dataset på 513 exempel som vi använde för träning och validering för respektive klassificerare. Resultatet visade att Random Forest hade betydligt högre träffsäkerhet med 95,7% jämfört med Naïve Bayes klassificerare (81,5%) och Support Vector Machines (78,6%). Slutsatsen för arbetet är att Random Forest med sina 95,7% ger en tillräckligt hög träffsäkerhet så att företaget kan använda maskininlärningsmodellen för att förbättra sin produkt. Resultatet pekar också på att Random Forest, för det här arbetets specifika klassificeringsproblem, är den klassificerare som fungerar bäst inom "supervised machine learning" men att det eventuellt finns möjlighet att få ännu högre träffsäkerhet med andra tekniker som till exempel "unsupervised machine learning" eller "semi-supervised machine learning".
A company has developed a environment-friendly IoT device with sequential sensor data and want to use machine learning to classify anomalies in their data. Throughout the years, several well working algorithms for classifications have been developed. However, there is no optimal algorithm for every problem. The purpose of this work was therefore to investigate, compare and evaluate different classifiers within supervised machine learning to find out which classifier that gives the best accuracy to classify anomalies in the kind of IoT device that the company has developed. With a literature review we first wanted to find out which classifiers that are commonly used and have worked well in related work for similar purposes and applications. We concluded to further compare and evaluate Random Forest, Naïve Bayes and Support Vector Machines. We created a dataset of 513 examples that we used for training and evaluation for each classifier. The result showed that Random Forest had superior accuracy with 95.7% compared to Naïve Bayes (81.5%) and Support Vector Machines (78.6%). The conclusion for this work is that Random Forest, with 95.7%, gives a high enough accuracy for the company to have good use of the machine learning model. The result also indicates that Random Forest, for this thesis specific classification problem, is the best classifier within supervised machine learning but that there is a potential possibility to get even higher accuracy with other techniques such as unsupervised machine learning or semi-supervised machine learning.

APA, Harvard, Vancouver, ISO, and other styles

15

Cleve, Oscar, and Sara Gustafsson. "Automatic Feature Extraction for Human Activity Recognitionon the Edge." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-260247.

Full text

Abstract:

This thesis evaluates two methods for automatic feature extraction to classify the accelerometer data of periodic and sporadic human activities. The first method selects features using individual hypothesis tests and the second one is using a random forest classifier as an embedded feature selector. The hypothesis test was combined with a correlation filter in this study. Both methods used the same initial pool of automatically generated time series features. A decision tree classifier was used to perform the human activity recognition task for both methods.The possibility of running the developed model on a processor with limited computing power was taken into consideration when selecting methods for evaluation. The classification results showed that the random forest method was good at prioritizing among features. With 23 features selected it had a macro average F1 score of 0.84 and a weighted average F1 score of 0.93. The first method, however, only had a macro average F1 score of 0.40 and a weighted average F1 score of 0.63 when using the same number of features. In addition to the classification performance this thesis studies the potential business benefits that automation of feature extractioncan result in.
Denna studie utvärderar två metoder som automatiskt extraherar features för att klassificera accelerometerdata från periodiska och sporadiska mänskliga aktiviteter. Den första metoden väljer features genom att använda individuella hypotestester och den andra metoden använder en random forest-klassificerare som en inbäddad feature-väljare. Hypotestestmetoden kombinerades med ett korrelationsfilter i denna studie. Båda metoderna använde samma initiala samling av automatiskt genererade features. En decision tree-klassificerare användes för att utföra klassificeringen av de mänskliga aktiviteterna för båda metoderna. Möjligheten att använda den slutliga modellen på en processor med begränsad hårdvarukapacitet togs i beaktning då studiens metoder valdes. Klassificeringsresultaten visade att random forest-metoden hade god förmåga att prioritera bland features. Med 23 utvalda features erhölls ett makromedelvärde av F1 score på 0,84 och ett viktat medelvärde av F1 score på 0,93. Hypotestestmetoden resulterade i ett makromedelvärde av F1 score på 0,40 och ett viktat medelvärde av F1 score på 0,63 då lika många features valdes ut. Utöver resultat kopplade till klassificeringsproblemet undersöker denna studie även potentiella affärsmässiga fördelar kopplade till automatisk extrahering av features.

APA, Harvard, Vancouver, ISO, and other styles

16

Khan, Syeduzzaman. "A PROBABILISTIC MACHINE LEARNING FRAMEWORK FOR CLOUD RESOURCE SELECTION ON THE CLOUD." Scholarly Commons, 2020. https://scholarlycommons.pacific.edu/uop_etds/3720.

Full text

Abstract:

The execution of the scientific applications on the Cloud comes with great flexibility, scalability, cost-effectiveness, and substantial computing power. Market-leading Cloud service providers such as Amazon Web service (AWS), Azure, Google Cloud Platform (GCP) offer various general purposes, memory-intensive, and compute-intensive Cloud instances for the execution of scientific applications. The scientific community, especially small research institutions and undergraduate universities, face many hurdles while conducting high-performance computing research in the absence of large dedicated clusters. The Cloud provides a lucrative alternative to dedicated clusters, however a wide range of Cloud computing choices makes the instance selection for the end-users. This thesis aims to simplify Cloud instance selection for end-users by proposing a probabilistic machine learning framework to allow to users select a suitable Cloud instance for their scientific applications. This research builds on the previously proposed A2Cloud-RF framework that recommends high-performing Cloud instances by profiling the application and the selected Cloud instances. The framework produces a set of objective scores called the A2Cloud scores, which denote the compatibility level between the application and the selected Cloud instances. When used alone, the A2Cloud scores become increasingly unwieldy with an increasing number of tested Cloud instances. Additionally, the framework only examines the raw application performance and does not consider the execution cost to guide resource selection. To improve the usability of the framework and assist with economical instance selection, this research adds two Naïve Bayes (NB) classifiers that consider both the application’s performance and execution cost. These NB classifiers include: 1) NB with a Random Forest Classifier (RFC) and 2) a standalone NB module. Naïve Bayes with a Random Forest Classifier (RFC) augments the A2Cloud-RF framework's final instance ratings with the execution cost metric. In the training phase, the classifier builds the frequency and probability tables. The classifier recommends a Cloud instance based on the highest posterior probability for the selected application. The standalone NB classifier uses the generated A2Cloud score (an intermediate result from the A2Cloud-RF framework) and execution cost metric to construct an NB classifier. The NB classifier forms a frequency table and probability (prior and likelihood) tables. For recommending a Cloud instance for a test application, the classifier calculates the highest posterior probability for all of the Cloud instances. The classifier recommends a Cloud instance with the highest posterior probability. This study performs the execution of eight real-world applications on 20 Cloud instances from AWS, Azure, GCP, and Linode. We train the NB classifiers using 80% of this dataset and employ the remaining 20% for testing. The testing yields more than 90% recommendation accuracy for the chosen applications and Cloud instances. Because of the imbalanced nature of the dataset and multi-class nature of classification, we consider the confusion matrix (true positive, false positive, true negative, and false negative) and F1 score with above 0.9 scores to describe the model performance. The final goal of this research is to make Cloud computing an accessible resource for conducting high-performance scientific executions by enabling users to select an effective Cloud instance from across multiple providers.

APA, Harvard, Vancouver, ISO, and other styles

17

Drábek, Matěj. "Využití vybraných metod strojového učení pro modelování kreditního rizika." Master's thesis, Vysoká škola ekonomická v Praze, 2017. http://www.nusl.cz/ntk/nusl-360509.

Full text

Abstract:

This master's thesis is divided into three parts. In the first part I described P2P lending, its characteristics, basic concepts and practical implications. I also compared P2P market in the Czech Republic, UK and USA. The second part consists of theoretical basics for chosen methods of machine learning, which are naive bayes classifier, classification tree, random forest and logistic regression. I also described methods to evaluate the quality of classification models listed above. The third part is a practical one and shows the complete workflow of creating classification model, from data preparation to evaluation of model.

APA, Harvard, Vancouver, ISO, and other styles

18

Kamat, Sai Shyamsunder. "Analyzing Radial Basis Function Neural Networks for predicting anomalies in Intrusion Detection Systems." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-259187.

Full text

Abstract:

In the 21st century, information is the new currency. With the omnipresence of devices connected to the internet, humanity can instantly avail any information. However, there are certain are cybercrime groups which steal the information. An Intrusion Detection System (IDS) monitors a network for suspicious activities and alerts its owner about an undesired intrusion. These commercial IDS’es react after detecting intrusion attempts. With the cyber attacks becoming increasingly complex, it is expensive to wait for the attacks to happen and respond later. It is crucial for network owners to employ IDS’es that preemptively differentiate a harmless data request from a malicious one. Machine Learning (ML) can solve this problem by recognizing patterns in internet traffic to predict the behaviour of network users. This project studies how effectively Radial Basis Function Neural Network (RBFN) with Deep Learning Architecture can impact intrusion detection. On the basis of the existing framework, it asks how well can an RBFN predict malicious intrusive attempts, especially when compared to contemporary detection practices.Here, an RBFN is a multi-layered neural network model that uses a radial basis function to transform input traffic data. Once transformed, it is possible to separate the various traffic data points using a single straight line in extradimensional space. The outcome of the project indicates that the proposed method is severely affected by limitations. E.g. the model needs to be fine tuned over several trials to achieve a desired accuracy. The results of the implementation show that RBFN is accurate at predicting various cyber attacks such as web attacks, infiltrations, brute force, SSH etc, and normal internet behaviour on an average 80% of the time. Other algorithms in identical testbed are more than 90% accurate. Despite the lower accuracy, RBFN model is more than 94% accurate at recording specific kinds of attacks such as Port Scans and BotNet malware. One possible solution is to restrict this model to predict only malware attacks and use different machine learning algorithm for other attacks.
I det 21: a århundradet är information den nya valutan. Med allnärvaro av enheter anslutna till internet har mänskligheten tillgång till information inom ett ögonblick. Det finns dock vissa grupper som använder metoder för att stjäla information för personlig vinst via internet. Ett intrångsdetekteringssystem (IDS) övervakar ett nätverk för misstänkta aktiviteter och varnar dess ägare om ett oönskat intrång skett. Kommersiella IDS reagerar efter detekteringen av ett intrångsförsök. Angreppen blir alltmer komplexa och det kan vara dyrt att vänta på att attackerna ska ske för att reagera senare. Det är avgörande för nätverksägare att använda IDS:er som på ett förebyggande sätt kan skilja på oskadlig dataanvändning från skadlig. Maskininlärning kan lösa detta problem. Den kan analysera all befintliga data om internettrafik, känna igen mönster och förutse användarnas beteende. Detta projekt syftar till att studera hur effektivt Radial Basis Function Neural Networks (RBFN) med Djupinlärnings arkitektur kan påverka intrångsdetektering. Från detta perspektiv ställs frågan hur väl en RBFN kan förutsäga skadliga intrångsförsök, särskilt i jämförelse med befintliga detektionsmetoder.Här är RBFN definierad som en flera-lagers neuralt nätverksmodell som använder en radiell grundfunktion för att omvandla data till linjärt separerbar. Efter en undersökning av modern litteratur och lokalisering av ett namngivet dataset användes kvantitativ forskningsmetodik med prestanda indikatorer för att utvärdera RBFN: s prestanda. En Random Forest Classifier algorithm användes också för jämförelse. Resultaten erhölls efter en serie finjusteringar av parametrar på modellerna. Resultaten visar att RBFN är korrekt när den förutsäger avvikande internetbeteende i genomsnitt 80% av tiden. Andra algoritmer i litteraturen beskrivs som mer än 90% korrekta. Den föreslagna RBFN-modellen är emellertid mycket exakt när man registrerar specifika typer av attacker som Port Scans och BotNet malware. Resultatet av projektet visar att den föreslagna metoden är allvarligt påverkad av begränsningar. T.ex. så behöver modellen finjusteras över flera försök för att uppnå önskad noggrannhet. En möjlig lösning är att begränsa denna modell till att endast förutsäga malware-attacker och använda andra maskininlärnings-algoritmer för andra attacker.

APA, Harvard, Vancouver, ISO, and other styles

19

Thanjavur, Bhaaskar Kiran Vishal. "Automatic generation of hardware Tree Classifiers." Thesis, 2017. https://hdl.handle.net/2144/23688.

Full text

Abstract:

Machine Learning is growing in popularity and spreading across different fields for various applications. Due to this trend, machine learning algorithms use different hardware platforms and are being experimented to obtain high test accuracy and throughput. FPGAs are well-suited hardware platform for machine learning because of its re-programmability and lower power consumption. Programming using FPGAs for machine learning algorithms requires substantial engineering time and effort compared to software implementation. We propose a software assisted design flow to program FPGA for machine learning algorithms using our hardware library. The hardware library is highly parameterized and it accommodates Tree Classifiers. As of now, our library consists of the components required to implement decision trees and random forests. The whole automation is wrapped around using a python script which takes you from the first step of having a dataset and design choices to the last step of having a hardware descriptive code for the trained machine learning model.

APA, Harvard, Vancouver, ISO, and other styles

20

Petrcich, William. "Dimensionality Reduction in the Creation of Classifiers and the Effects of Correlation, Cluster Overlap, and Modelling Assumptions." Thesis, 2011. http://hdl.handle.net/10214/2933.

Full text

Abstract:

Discriminant analysis and random forests are used to create models for classification. The number of variables to be tested for inclusion in a model can be large. The goal of this work was to create an efficient and effective selection program. The first method used was based on the work of others. The resulting models were underperforming, so another approach was adopted. Models were built by adding the variable that maximized new-model accuracy. The two programs were used to generate discriminant-analysis and random forest models for three data sets. An existing software package was also used. The second program outperformed the alternatives. For the small number of runs produced in this study, it outperformed the method that inspired this work. The data sets were studied to identify determinants of performance. No definite conclusions were reached, but the results suggest topics for future study.

APA, Harvard, Vancouver, ISO, and other styles

21

Hricová, Jana. "Metody kontrukce klasifikátorů vhodných pro segmentaci zákazníků." Master's thesis, 2013. http://www.nusl.cz/ntk/nusl-327849.

Full text

Abstract:

Title: Construction of classifiers suitable for segmentation of clients Author: Bc. Jana Hricová Department: Department of Probability and Mathematical Statistics Supervisor: prof. RNDr. Jaromír Antoch, CSc., Department of Probability and Mathematical Statistics Abstract: The master thesis discusses methods that are a part of the data analy- sis, called classification. In the thesis are presented classification methods used to construct tree like classifiers suitable for customer segmentation. Core methodo- logy that is discussed in our thesis is CART (Classification and Regression Trees) and then methodologies around ensemble models that use historical data to cons- truct classification and regression forests, namely Bagging, Boosting, Arcing and Random Forest. Here described methods were applied to real data from the field of customer segmentation and also to simulated data, both processed with RStudio software. Keywords: classification, tree like classifiers, random forests

APA, Harvard, Vancouver, ISO, and other styles

22

WANG, TING-BEN, and 王庭本. "Artificial Bee Colony Algorithm to Construct Non - Random Forest Classifier Ensemble." Thesis, 2017. http://ndltd.ncl.edu.tw/handle/xr22td.

Full text

Abstract:

碩士
華梵大學
資訊管理學系碩士班
105
Data classification method is one of the main tasks of data mining. In the literature, there are many classic base inducers used to train the classifier such as neural network, decision tree…etc., which are all individual classifier. In the past few years, many researches have proposed that the classifier ensemble, which composed by more than one individual classifier, is more effective than any individual classifier of the classifier ensemble. A classifier ensemble is a set of classifiers that are diverse and yet accurate and individual decisions are combined in some way (typically by weighted or unweighted voting) to classify new examples. RF is an ensemble learning method that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes output by individual trees.This paper would propose an ensemble method（Evolutionary Computation-based Non Random Forests, ECNRF）that uses the Artificial Bee Colony algorithm to encourage the diversity between classifiers by manipulating the train data set. We design an experiment using 15 UCI Repository of machine learning databases to test and verify and then comparing with individual classifier and other classifier ensembles. The result provides that the ECNRF in our experiment has better average accuracy（81.23%）and is significantly difference.

APA, Harvard, Vancouver, ISO, and other styles

23

Liao, Wei-Jie, and 廖偉傑. "Combining Spatiotemporal Background Modeling and Random Forest Classifier for Foreground Segmentation and Shadow Removal." Thesis, 2014. http://ndltd.ncl.edu.tw/handle/21944271977301550985.

Full text

Abstract:

碩士
國立臺灣大學
電機工程學研究所
102
Cast shadows detection and removal is indispensable in the object detection to many surveillance applications. In this paper, we present a novel framework for removing cast shadow of moving objects. Two main components, moving objects detector and redundant shadow remover, are integrated. For moving objects, we adopt the spatiotemporal background extractor (SBE) to detect the moving objects which is comprised of the background extractor (BE) and the background gradient extractor (BGE). SBE features the object detection in the dynamic background and the sudden lighting changes environment. For shadow removal, we use the classifier, Random Forest, to learn the shadow features, which are chromaticity, physical properties, and texture. Then, we remove the shadow from the result of SBE with the shadow classifier. The proposed method can effectively detect the moving objects and remove the shadow effect. Furthermore, we demonstrate the performance of our method compared with some state-of-the-art techniques of object detection and shadow removal.

APA, Harvard, Vancouver, ISO, and other styles

24

Chao, Fu-Mao, and 趙福懋. "A Real-time Dynamic Gesture Recognition System for Basketball Referees Based on a Random Forest Classifier." Thesis, 2017. http://ndltd.ncl.edu.tw/handle/76462921171933275049.

Full text

Abstract:

碩士
國立臺灣科技大學
資訊工程系
105
In recent years, the importance of human-computer interaction has been gradually improved. Gesture interaction behavior has gradually filled our lives, but has also produced many problems and derivatives of a number of technical issues waiting to be resolved. With the rapid evolution and development of science and technology, human beings are using the technology to achieve communication and interaction with the machine. In this paper, we used a single webcam as the medium for image inputs, and established a real-time dynamic gesture recognition system. First, in order to find out the area where the hand and the head are in the images, we segmented the user-defined skin color area set by the system. Then we used the method of geometry calculation to obtain information about the hands and head, providing information to allow the system to interact with the user instantly. Finally, we refined the gesture data by normalization, and then used Random Forest method in machine learning to deal with dynamic gesture data for training and recognizing. We collected the six NBA referee gestures and created a database of five different people. There were 600 gesture data. The experimental results show that the average accuracy of the six gesture is 98.5%. The system we proposed can achieve the performance of real-time recognition, for 640 × 640 size images, and the overall average performance is 30 frames per second.

APA, Harvard, Vancouver, ISO, and other styles

25

Rodríguez, Hernán Cortés. "Ensemble classifiers in remote sensing: a comparative analysis." Master's thesis, 2014. http://hdl.handle.net/10362/11671.

Full text

Abstract:

Dissertation submitted in partial fulfillment of the requirements for the Degree of Master of Science in Geospatial Technologies.
Land Cover and Land Use (LCLU) maps are very important tools for understanding the relationships between human activities and the natural environment. Defining accurately all the features over the Earth's surface is essential to assure their management properly. The basic data which are being used to derive those maps are remote sensing imagery (RSI), and concretely, satellite images. Hence, new techniques and methods able to deal with those data and at the same time, do it accurately, have been demanded. In this work, our goal was to have a brief review over some of the currently approaches in the scientific community to face this challenge, to get higher accuracy in LCLU maps. Although, we will be focus on the study of the classifiers ensembles and the different strategies that those ensembles present in the literature. We have proposed different ensembles strategies based in our data and previous work, in order to increase the accuracy of previous LCLU maps made by using the same data and single classifiers. Finally, only one of the ensembles proposed have got significantly higher accuracy, in the classification of LCLU map, than the better single classifier performance with the same data. Also, it was proved that diversity did not play an important role in the success of this ensemble.

APA, Harvard, Vancouver, ISO, and other styles

26

(6642491), Jingzhao Dai. "SPARSE DISCRETE WAVELET DECOMPOSITION AND FILTER BANK TECHNIQUES FOR SPEECH RECOGNITION." Thesis, 2019.

Find full text

Abstract:

Speech recognition is widely applied to translation from speech to related text, voice driven commands, human machine interface and so on [1]-[8]. It has been increasingly proliferated to Human’s lives in the modern age. To improve the accuracy of speech recognition, various algorithms such as artificial neural network, hidden Markov model and so on have been developed [1], [2].

In this thesis work, the tasks of speech recognition with various classifiers are investigated. The classifiers employed include the support vector machine (SVM), k-nearest neighbors (KNN), random forest (RF) and convolutional neural network (CNN). Two novel features extraction methods of sparse discrete wavelet decomposition (SDWD) and bandpass filtering (BPF) based on the Mel filter banks [9] are developed and proposed. In order to meet diversity of classification algorithms, one-dimensional (1D) and two-dimensional (2D) features are required to be obtained. The 1D features are the array of power coefficients in frequency bands, which are dedicated for training SVM, KNN and RF classifiers while the 2D features are formed both in frequency domain and temporal variations. In fact, the 2D feature consists of the power values in decomposed bands versus consecutive speech frames. Most importantly, the 2D feature with geometric transformation are adopted to train CNN.

Speech recognition including males and females are from the recorded data set as well as the standard data set. Firstly, the recordings with little noise and clear pronunciation are applied with the proposed feature extraction methods. After many trials and experiments using this dataset, a high recognition accuracy is achieved. Then, these feature extraction methods are further applied to the standard recordings having random characteristics with ambient noise and unclear pronunciation. Many experiment results validate the effectiveness of the proposed feature extraction techniques.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Random Forests Classifiers'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles