Se connecter

Bibliographies thématiques / ENSEMBLE LEARNING TECHNIQUE / Thèses

Thèses sur le sujet « ENSEMBLE LEARNING TECHNIQUE »

Pour voir les autres types de publications sur ce sujet consultez le lien suivant : ENSEMBLE LEARNING TECHNIQUE.

Auteur : Grafiati

Publié le 11 septembre 2023

Créez une référence correcte selon les styles APA, MLA, Chicago, Harvard et plusieurs autres

Choisissez une source :

Consultez les 23 meilleures thèses pour votre recherche sur le sujet « ENSEMBLE LEARNING TECHNIQUE ».

À côté de chaque source dans la liste de références il y a un bouton « Ajouter à la bibliographie ». Cliquez sur ce bouton, et nous générerons automatiquement la référence bibliographique pour la source choisie selon votre style de citation préféré : APA, MLA, Harvard, Vancouver, Chicago, etc.

Vous pouvez aussi télécharger le texte intégral de la publication scolaire au format pdf et consulter son résumé en ligne lorsque ces informations sont inclues dans les métadonnées.

Parcourez les thèses sur diverses disciplines et organisez correctement votre bibliographie.

1

King, Michael Allen. « Ensemble Learning Techniques for Structured and Unstructured Data ». Diss., Virginia Tech, 2015. http://hdl.handle.net/10919/51667.

Texte intégral

Résumé :

This research provides an integrated approach of applying innovative ensemble learning techniques that has the potential to increase the overall accuracy of classification models. Actual structured and unstructured data sets from industry are utilized during the research process, analysis and subsequent model evaluations. The first research section addresses the consumer demand forecasting and daily capacity management requirements of a nationally recognized alpine ski resort in the state of Utah, in the United States of America. A basic econometric model is developed and three classic predictive models evaluated the effectiveness. These predictive models were subsequently used as input for four ensemble modeling techniques. Ensemble learning techniques are shown to be effective. The second research section discusses the opportunities and challenges faced by a leading firm providing sponsored search marketing services. The goal for sponsored search marketing campaigns is to create advertising campaigns that better attract and motivate a target market to purchase. This research develops a method for classifying profitable campaigns and maximizing overall campaign portfolio profits. Four traditional classifiers are utilized, along with four ensemble learning techniques, to build classifier models to identify profitable pay-per-click campaigns. A MetaCost ensemble configuration, having the ability to integrate unequal classification cost, produced the highest campaign portfolio profit. The third research section addresses the management challenges of online consumer reviews encountered by service industries and addresses how these textual reviews can be used for service improvements. A service improvement framework is introduced that integrates traditional text mining techniques and second order feature derivation with ensemble learning techniques. The concept of GLOW and SMOKE words is introduced and is shown to be an objective text analytic source of service defects or service accolades.
Ph. D.

Styles APA, Harvard, Vancouver, ISO, etc.

2

Nguyen, Thanh Tien. « Ensemble Learning Techniques and Applications in Pattern Classification ». Thesis, Griffith University, 2017. http://hdl.handle.net/10072/366342.

Texte intégral

Résumé :

It is widely known that the best classifier for a given problem is often problem dependent and there is no one classification algorithm that is the best for all classification tasks. A natural question that arise is: can we combine multiple classification algorithms to achieve higher classification accuracy than a single one? That is the idea behind a class of methods called ensemble method. Ensemble method is defined as the combination of several classifiers with the aim of achieving lower classification error rate than using a single classifier. Ensemble methods have been applying to various applications ranging from computer aided medical diagnosis, computer vision, software engineering, to information retrieval. In this study, we focus on heterogeneous ensemble methods in which a fixed set of diverse learning algorithms are learned on the same training set to generate the different classifiers and the class prediction is then made based on the output of these classifiers (called Level1 data or meta-data). The research on heterogeneous ensemble methods is mainly focused on two aspects: (i) to propose efficient classifiers combining methods on meta-data to achieve high accuracy, and (ii) to optimize the ensemble by performing feature and classifier selection. Although various approaches related to heterogeneous ensemble methods have been proposed, some research gaps still exist First, in ensemble learning, the meta-data of an observation reflects the agreement and disagreement between the different base classifiers.
Thesis (PhD Doctorate)
Doctor of Philosophy (PhD)
School of Information and Communication Technology
Science, Environment, Engineering and Technology
Full Text

Styles APA, Harvard, Vancouver, ISO, etc.

3

Valenzuela, Russell. « Predicting National Basketball Association Game Outcomes Using Ensemble Learning Techniques ». Thesis, California State University, Long Beach, 2019. http://pqdtopen.proquest.com/#viewpdf?dispub=10980443.

Texte intégral

Résumé :

There have been a number of studies that try to predict sporting event outcomes. Most previous research has involved results in football and college basketball. Recent years has seen similar approaches carried out in professional basketball. This thesis attempts to build upon existing statistical techniques and apply them to the National Basketball Association using a synthesis of algorithms as motivation. A number of ensemble learning methods will be utilized and compared in hopes of improving the accuracy of single models. Individual models used in this thesis will be derived from Logistic Regression, Naïve Bayes, Random Forests, Support Vector Machines, and Artificial Neural Networks while aggregation techniques include Bagging, Boosting, and Stacking. Data from previous seasons and games from both?players and teams will be used to train models in R.

Styles APA, Harvard, Vancouver, ISO, etc.

4

Johansson, Alfred. « Ensemble approach to code smell identification : Evaluating ensemble machine learning techniques to identify code smells within a software system ». Thesis, Tekniska Högskolan, Jönköping University, JTH, Datateknik och informatik, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:hj:diva-49319.

Texte intégral

Résumé :

The need for automated methods for identifying refactoring items is prelevent in many software projects today. Symptoms of refactoring needs is the concept of code smells within a software system. Recent studies have used single model machine learning to combat this issue. This study aims to test the possibility of improving machine learning code smell detection using ensemble methods. Therefore identifying the strongest ensemble model in the context of code smells and the relative sensitivity of the strongest perfoming ensemble identified. The ensemble models performance was studied by performing experiments using WekaNose to create datasets of code smells and Weka to train and test the models on the dataset. The datasets created was based on Qualitas Corpus curated java project. Each tested ensemble method was then compared to all the other ensembles, using f-measure, accuracy and AUC ROC scores. The tested ensemble methods were stacking, voting, bagging and boosting. The models to implement the ensemble methods with were models that previous studies had identified as strongest performer for code smell identification. The models where Jrip, J48, Naive Bayes and SMO. The findings showed, that compared to previous studies, bagging J48 improved results by 0.5%. And that the nominally implemented baggin of J48 in Weka follows best practices and the model where impacted negatively. However, due to the complexity of stacking and voting ensembles further work is needed regarding stacking and voting ensemble models in the context of code smell identification.

Styles APA, Harvard, Vancouver, ISO, etc.

5

Recamonde-Mendoza, Mariana. « Exploring ensemble learning techniques to optimize the reverse engineering of gene regulatory networks ». reponame:Biblioteca Digital de Teses e Dissertações da UFRGS, 2014. http://hdl.handle.net/10183/95693.

Texte intégral

Résumé :

Nesta tese estamos especificamente interessados no problema de engenharia re- versa de redes regulatórias genéticas a partir de dados de pós-genômicos, um grande desafio na área de Bioinformática. Redes regulatórias genéticas são complexos cir- cuitos biológicos responsáveis pela regulação do nível de expressão dos genes, desem- penhando assim um papel fundamental no controle de inúmeros processos celulares, incluindo diferenciação celular, ciclo celular e metabolismo. Decifrar a estrutura destas redes é crucial para possibilitar uma maior compreensão à nível de sistema do desenvolvimento e comportamento dos organismos, e eventualmente esclarecer os mecanismos de doenças causados pela desregulação dos processos acima mencio- nados. Devido ao expressivo aumento da disponibilidade de dados experimentais de larga escala e da grande dimensão e complexidade dos sistemas biológicos, métodos computacionais têm sido ferramentas essenciais para viabilizar esta investigação. No entanto, seu desempenho ainda é bastante deteriorado por importantes desafios com- putacionais e biológicos impostos pelo cenário. Em particular, o ruído e esparsidade inerentes aos dados biológicos torna este problema de inferência de redes um difícil problema de otimização combinatória, para o qual métodos computacionais dispo- níveis falham em relação à exatidão e robustez das predições. Esta tese tem como objetivo investigar o uso de técnicas de ensemble learning como forma de superar as limitações existentes e otimizar o processo de inferência, explorando a diversidade entre um conjunto de modelos. Com este intuito, desenvolvemos métodos computa- cionais tanto para gerar redes diversificadas, como para combinar estas predições em uma solução única (solução ensemble ), e aplicamos esta abordagem a uma série de cenários com diferentes fontes de diversidade a fim de compreender o seu potencial neste contexto específico. Mostramos que as soluções propostas são competitivas com algoritmos tradicionais deste campo de pesquisa e que melhoram nossa capa- cidade de reconstruir com precisão as redes regulatórias genéticas. Os resultados obtidos para a inferência de redes de regulação transcricional e pós-transcricional, duas camadas adjacentes e complementares que compõem a rede de regulação glo- bal, tornam evidente a eficiência e robustez da nossa abordagem, encorajando a consolidação de ensemble learning como uma metodologia promissora para decifrar a estrutura de redes regulatórias genéticas.
In this thesis we are concerned about the reverse engineering of gene regulatory networks from post-genomic data, a major challenge in Bioinformatics research. Gene regulatory networks are intricate biological circuits responsible for govern- ing the expression levels (activity) of genes, thereby playing an important role in the control of many cellular processes, including cell differentiation, cell cycle and metabolism. Unveiling the structure of these networks is crucial to gain a systems- level understanding of organisms development and behavior, and eventually shed light on the mechanisms of diseases caused by the deregulation of these cellular pro- cesses. Due to the increasing availability of high-throughput experimental data and the large dimension and complexity of biological systems, computational methods have been essential tools in enabling this investigation. Nonetheless, their perfor- mance is much deteriorated by important computational and biological challenges posed by the scenario. In particular, the noisy and sparse features of biological data turn the network inference into a challenging combinatorial optimization prob- lem, to which current methods fail in respect to the accuracy and robustness of predictions. This thesis aims at investigating the use of ensemble learning tech- niques as means to overcome current limitations and enhance the inference process by exploiting the diversity among multiple inferred models. To this end, we develop computational methods both to generate diverse network predictions and to combine multiple predictions into an ensemble solution, and apply this approach to a number of scenarios with different sources of diversity in order to understand its potential in this specific context. We show that the proposed solutions are competitive with tra- ditional algorithms in the field and improve our capacity to accurately reconstruct gene regulatory networks. Results obtained for the inference of transcriptional and post-transcriptional regulatory networks, two adjacent and complementary layers of the overall gene regulatory network, evidence the efficiency and robustness of our approach, encouraging the consolidation of ensemble systems as a promising methodology to decipher the structure of gene regulatory networks.

Styles APA, Harvard, Vancouver, ISO, etc.

6

Luong, Vu A. « Advanced techniques for classification of non-stationary streaming data and applications ». Thesis, Griffith University, 2022. http://hdl.handle.net/10072/420554.

Texte intégral

Résumé :

Today we are going through the Industry 4.0, where not only people are connected via social networks, but an enormous number of electronic devices are also connected via the Internet of Things (IoT). With the rapid development of modern technologies like blockchains, 5G, computing chips, software infrastructures, people and application programs are interacting with each other at a very fast pace. As a result, massive amount of data is generated in real time, posing many interesting but challenging problems to the machine learning community. According to the International Data Corporation [73], the total amount of data created, captured, copied, and consumed globally is forecast to increase rapidly, reaching more than 180 zettabytes by 2025, which is approximately 90 times of the data volume in 2010. However, only a minor subset of this newly created data is saved, as just about two percentage of the data created and consumed in 2020 was retained into 2021. One of the reasons for this is that data from numerous real-world applications are often collected in the form of unbounded data streams, including sensor networks, video services, event logs, and traffic monitoring systems, which would exceed the physical storage eventually. Therefore, the two major concerns about unlimited data volume and fast velocity remain unsolved when dealing with data streams. Traditional offline machine learning methods have successfully solved many intelligence tasks in recent years, most notably computer vision and natural language processing. However, they are not efficient when dealing with data streams generated dynamically in real-time from above mentioned applications. Particularly, the offline learning paradigm suffers from many limitations in this context: (1) it is not practical to store the entire data stream in memory; (2) traditional algorithms need to be retrained when new training data instances are available; and (3) the slow training time makes them almost impossible to adapt instantly to real-time data. In this study, we focus on developing new ensemble methods to solve the problems of data stream classification. Although several studies related to ensemble learning have been proposed in the literature, some research gaps still exist. First, most ensemble algorithms developed for evolving data streams are homogenous, which means that all the base classifiers are generated from the same learning algorithm, most frequently Hoeffding Trees. The data stream literature is lacking heterogeneous ensembles, which benefit from having much fewer base learners to obtain comparable prediction performance to homogeneous ensembles. Therefore, we introduce the HEterogeneous Ensemble Selection (HEES) method that dynamically determines an appropriate subset of base learners to make predictions for non-stationary data streams. Though HEES only uses 8 base classifiers, our experiments on 50 datasets shows that its prediction accuracy is higher than other homogeneous ensembles with 40 base classifiers, including OzaBagAdwin, OzaBoostAdwin, BOLE, and LNSE. Second, most existing models in the literature have low expressive capability; hence, we propose a Streaming Deep Forest (SDF) method to fill this gap. Also, an active learning strategy is introduced to save the label query costs and to speed up SDF. As a result, SDF obtains the state-of-the-art accuracy in both immediate and delayed settings. Next, we present a multi-layer heterogeneous ensemble called SMiLE and a selection method to tackle the real-world problem of insect stream classification. In our experiments, SMiLE achieves the best performance on 10/11 datasets and second-best performance on the remaining dataset in comparison to benchmark algorithms. Finally, we propose an incremental framework to combine different segmentation models for medical images. The proposed framework is about 16 times faster than the second-fastest method MLR on both CVC_ColonDB and MICCAI2015 datasets in our experiments.
Thesis (PhD Doctorate)
Doctor of Philosophy (PhD)
School of Info & Comm Tech
Science, Environment, Engineering and Technology
Full Text

Styles APA, Harvard, Vancouver, ISO, etc.

7

Wang, Xian Bo. « A novel fault detection and diagnosis framework for rotating machinery using advanced signal processing techniques and ensemble extreme learning machines ». Thesis, University of Macau, 2018. http://umaclib3.umac.mo/record=b3951596.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

8

Etienam, Clement. « Structural and shape reconstruction using inverse problems and machine learning techniques with application to hydrocarbon reservoirs ». Thesis, University of Manchester, 2019. https://www.research.manchester.ac.uk/portal/en/theses/structural-and-shape-reconstruction-using-inverse-problems-and-machine-learning-techniques-with-application-to-hydrocarbon-reservoirs(e21f1030-64e7-4267-b708-b7f0165a5f53).html.

Texte intégral

Résumé :

This thesis introduces novel ideas in subsurface reservoir model calibration known as History Matching in the reservoir engineering community. The target of history matching is to mimic historical pressure and production data from the producing wells with the output from the reservoir simulator for the sole purpose of reducing uncertainty from such models and improving confidence in production forecast. Ensemble based methods such as the Ensemble Kalman Filter (EnKF) and Ensemble Smoother with Multiple Data Assimilation (ES-MDA) as been proposed for history matching in literature. EnKF/ES-MDA is a Monte Carlo ensemble nature filter where the representation of the covariance is located at the mean of the ensemble of the distribution instead of the uncertain true model. In EnKF/ES-MDA calculation of the gradients is not required, and the mean of the ensemble of the realisations provides the best estimates with the ensemble on its own estimating the probability density. However, because of the inherent assumptions of linearity and Gaussianity of petrophysical properties distribution, EnKF/ES-MDA does not provide an acceptable history-match and characterisation of uncertainty when tasked with calibrating reservoir models with channel like structures. One of the novel methods introduced in this thesis combines a successive parameter and shape reconstruction using level set functions (EnKF/ES-MDA-level set) where the spatial permeability fields' indicator functions are transformed into signed distances. These signed distances functions (better suited to the Gaussian requirement of EnKF/ES-MDA) are then updated during the EnKF/ES-MDA inversion. The method outperforms standard EnKF/ES-MDA in retaining geological realism of channels during and after history matching and also yielded lower Root-Mean-Square function (RMS) as compared to the standard EnKF/ES-MDA. To improve on the petrophysical reconstruction attained with the EnKF/ES-MDA-level set technique, a novel parametrisation incorporating an unsupervised machine learning method for the recovery of the permeability and porosity field is developed. The permeability and porosity fields are posed as a sparse field recovery problem and a novel SELE (Sparsity-Ensemble optimization-Level-set Ensemble optimisation) approach is proposed for the history matching. In SELE some realisations are learned using the K-means clustering Singular Value Decomposition (K-SVD) to generate an overcomplete codebook or dictionary. This dictionary is combined with Orthogonal Matching Pursuit (OMP) to ease the ill-posed nature of the production data inversion, converting our permeability/porosity field into a sparse domain. SELE enforces prior structural information on the model during the history matching and reduces the computational complexity of the Kalman gain matrix, leading to faster attainment of the minimum of the cost function value. From the results shown in the thesis; SELE outperforms conventional EnKF/ES-MDA in matching the historical production data, evident in the lower RMS value and a high geological realism/similarity to the true reservoir model.

Styles APA, Harvard, Vancouver, ISO, etc.

9

Taylor, Farrell R. « Evaluation of Supervised Machine Learning for Classifying Video Traffic ». NSUWorks, 2016. http://nsuworks.nova.edu/gscis_etd/972.

Texte intégral

Résumé :

Operational deployment of machine learning based classifiers in real-world networks has become an important area of research to support automated real-time quality of service decisions by Internet service providers (ISPs) and more generally, network administrators. As the Internet has evolved, multimedia applications, such as voice over Internet protocol (VoIP), gaming, and video streaming, have become commonplace. These traffic types are sensitive to network perturbations, e.g. jitter and delay. Automated quality of service (QoS) capabilities offer a degree of relief by prioritizing network traffic without human intervention; however, they rely on the integration of real-time traffic classification to identify applications. Accordingly, researchers have begun to explore various techniques to incorporate into real-world networks. One method that shows promise is the use of machine learning techniques trained on sub-flows – a small number of consecutive packets selected from different phases of the full application flow. Generally, research on machine learning classifiers was based on statistics derived from full traffic flows, which can limit their effectiveness (recall and precision) if partial data captures are encountered by the classifier. In real-world networks, partial data captures can be caused by unscheduled restarts/reboots of the classifier or data capture capabilities, network interruptions, or application errors. Research on the use of machine learning algorithms trained on sub-flows to classify VoIP and gaming traffic has shown promise, even when partial data captures are encountered. This research extends that work by applying machine learning algorithms trained on multiple sub-flows to classification of video streaming traffic. Results from this research indicate that sub-flow classifiers have much higher and more consistent recall and precision than full flow classifiers when applied to video traffic. Moreover, the application of ensemble methods, specifically Bagging and adaptive boosting (AdaBoost) further improves recall and precision for sub-flow classifiers. Findings indicate sub-flow classifiers based on AdaBoost in combination with the C4.5 algorithm exhibited the best performance with the most consistent results for classification of video streaming traffic.

Styles APA, Harvard, Vancouver, ISO, etc.

10

Vandoni, Jennifer. « Ensemble Methods for Pedestrian Detection in Dense Crowds ». Thesis, Université Paris-Saclay (ComUE), 2019. http://www.theses.fr/2019SACLS116/document.

Texte intégral

Résumé :

Cette thèse s’intéresse à la détection des piétons dans des foules très denses depuis un système mono-camera, avec comme but d’obtenir des détections localisées de toutes les personnes. Ces détections peuvent être utilisées soit pour obtenir une estimation robuste de la densité, soit pour initialiser un algorithme de suivi. Les méthodologies classiques utilisées pour la détection de piétons s’adaptent mal au cas où seulement les têtes sont visibles, de part l’absence d’arrière-plan, l’homogénéité visuelle de la foule, la petite taille des objets et la présence d’occultations très fortes. En présence de problèmes difficiles tels que notre application, les approches à base d’apprentissage supervisé sont bien adaptées. Nous considérons un système à plusieurs classifieurs (Multiple Classifier System, MCS), composé de deux ensembles différents, le premier basé sur les classifieurs SVM (SVM- ensemble) et le deuxième basé sur les CNN (CNN-ensemble), combinés dans le cadre de la Théorie des Fonctions de Croyance (TFC). L’ensemble SVM est composé de plusieurs SVM exploitant les données issues d’un descripteur différent. La TFC nous permet de prendre en compte une valeur d’imprécision supposée correspondre soit à une imprécision dans la procédure de calibration, soit à une imprécision spatiale. Cependant, le manque de données labellisées pour le cas des foules très denses nuit à la génération d’ensembles de données d’entrainement et de validation robustes. Nous avons proposé un algorithme d’apprentissage actif de type Query-by- Committee (QBC) qui permet de sélectionner automatiquement de nouveaux échantillons d’apprentissage. Cet algorithme s’appuie sur des mesures évidentielles déduites des fonctions de croyance. Pour le second ensemble, pour exploiter les avancées de l’apprentissage profond, nous avons reformulé notre problème comme une tâche de segmentation en soft labels. Une architecture entièrement convolutionelle a été conçue pour détecter les petits objets grâce à des convolutions dilatées. Nous nous sommes appuyés sur la technique du dropout pour obtenir un ensemble CNN capable d’évaluer la fiabilité sur les prédictions du réseau lors de l’inférence. Les réalisations de cet ensemble sont ensuite combinées dans le cadre de la TFC. Pour conclure, nous montrons que la sortie du MCS peut être utile aussi pour le comptage de personnes. Nous avons proposé une méthodologie d’évaluation multi-échelle, très utile pour la communauté de modélisation car elle lie incertitude (probabilité d’erreur) et imprécision sur les valeurs de densité estimées
This study deals with pedestrian detection in high- density crowds from a mono-camera system. The detections can be then used both to obtain robust density estimation, and to initialize a tracking algorithm. One of the most difficult challenges is that usual pedestrian detection methodologies do not scale well to high-density crowds, for reasons such as absence of background, high visual homogeneity, small size of the objects, and heavy occlusions. We cast the detection problem as a Multiple Classifier System (MCS), composed by two different ensembles of classifiers, the first one based on SVM (SVM-ensemble) and the second one based on CNN (CNN-ensemble), combined relying on the Belief Function Theory (BFT) to exploit their strengths for pixel-wise classification. SVM-ensemble is composed by several SVM detectors based on different gradient, texture and orientation descriptors, able to tackle the problem from different perspectives. BFT allows us to take into account the imprecision in addition to the uncertainty value provided by each classifier, which we consider coming from possible errors in the calibration procedure and from pixel neighbor's heterogeneity in the image space. However, scarcity of labeled data for specific dense crowd contexts reflects in the impossibility to obtain robust training and validation sets. By exploiting belief functions directly derived from the classifiers' combination, we propose an evidential Query-by-Committee (QBC) active learning algorithm to automatically select the most informative training samples. On the other side, we explore deep learning techniques by casting the problem as a segmentation task with soft labels, with a fully convolutional network designed to recover small objects thanks to a tailored use of dilated convolutions. In order to obtain a pixel-wise measure of reliability about the network's predictions, we create a CNN- ensemble by means of dropout at inference time, and we combine the different obtained realizations in the context of BFT. Finally, we show that the output map given by the MCS can be employed to perform people counting. We propose an evaluation method that can be applied at every scale, providing also uncertainty bounds on the estimated density

Styles APA, Harvard, Vancouver, ISO, etc.

11

Pereira, Vinicius Gomes. « Using supervised machine learning and sentiment analysis techniques to predict homophobia in portuguese tweets ». reponame:Repositório Institucional do FGV, 2018. http://hdl.handle.net/10438/24301.

Texte intégral

Résumé :

Submitted by Vinicius Pereira (viniciusgomespe@gmail.com) on 2018-06-26T20:56:26Z No. of bitstreams: 1 DissertacaoFinal.pdf: 2029614 bytes, checksum: 3eda3dc97f25c0eecd86608653150d82 (MD5)
Approved for entry into archive by Janete de Oliveira Feitosa (janete.feitosa@fgv.br) on 2018-07-11T12:40:51Z (GMT) No. of bitstreams: 1 DissertacaoFinal.pdf: 2029614 bytes, checksum: 3eda3dc97f25c0eecd86608653150d82 (MD5)
Made available in DSpace on 2018-07-16T17:48:51Z (GMT). No. of bitstreams: 1 DissertacaoFinal.pdf: 2029614 bytes, checksum: 3eda3dc97f25c0eecd86608653150d82 (MD5) Previous issue date: 2018-04-16
Este trabalho estuda a identificação de tweets homofóbicos, utilizando uma abordagem de processamento de linguagem natural e aprendizado de máquina. O objetivo é construir um modelo preditivo que possa detectar, com razoável precisão, se um Tweet contém conteúdo ofensivo a indivı́duos LGBT ou não. O banco de dados utilizado para treinar os modelos preditivos foi construı́do agregando tweets de usuários que interagiram com polı́ticos e/ou partidos polı́ticos no Brasil. Tweets contendo termos relacionados a LGBTs ou que têm referências a indivı́duos LGBT foram coletados e classificados manualmente. Uma grande parte deste trabalho está na construção de features que capturam com precisão não apenas o texto do tweet, mas também caracterı́sticas especı́ficas dos usuários e de expressões coloquiais do português. Em particular, os usos de palavrões e vocabulários especı́ficos são um forte indicador de tweets ofensivos. Naturalmente, n-gramas e esquemas de frequência de termos também foram considerados como caracterı́sticas do modelo. Um total de 12 conjuntos de recursos foram construı́dos. Uma ampla gama de técnicas de aprendizado de máquina foi empregada na tarefa de classificação: Naive Bayes, regressões logı́sticas regularizadas, redes neurais feedforward, XGBoost (extreme gradient boosting), random forest e support vector machines. Depois de estimar e ajustar cada modelo, eles foram combinados usando voting e stacking. Voting utilizando 10 modelos obteve o melhor resultado, com 89,42% de acurácia.
This work studies the identification of homophobic tweets from a natural language processing and machine learning approach. The goal is to construct a predictive model that can detect, with reasonable accuracy, whether a Tweet contains offensive content to LGBT or not. The database used to train the predictive models was constructed aggregating tweets from users that have interacted with politicians and/or political parties in Brazil. Tweets containing LGBT-related terms or that have references to open LGBT individuals were collected and manually classified. A large part of this work is in constructing features that accurately capture not only the text of the tweet but also specific characteristics of the users and language choices. In particular, the uses of swear words and strong vocabulary is a quite strong predictor of offensive tweets. Naturally, n-grams and term weighting schemes were also considered as features of the model. A total of 12 sets of features were constructed. A broad range of machine learning techniques were employed in the classification task: naive Bayes, regularized logistic regressions, feedforward neural networks, extreme gradient boosting (XGBoost), random forest and support vector machines. After estimating and tuning each model, they were combined using voting and stacking. Voting using 10 models obtained the best result, with 89.42% accuracy.

Styles APA, Harvard, Vancouver, ISO, etc.

12

Bui, Minh Thanh. « Statistical modeling, level-set and ensemble learning for automatic segmentation of 3D high-frequency ultrasound data : towards expedited quantitative ultrasound in lymph nodes from cancer patients ». Thesis, Paris 6, 2016. http://www.theses.fr/2016PA066146/document.

Texte intégral

Résumé :

Afin d'accélérer et automatiser l'analyse par ultrasons quantitatifs de ganglions lymphatiques de patients atteints d'un cancer, plusieurs segmentations automatiques des trois milieux rencontrés (le parenchyme du ganglion, la graisse périnodale et le sérum physiologique) sont étudiées. Une analyse statistique du signal d'enveloppe a permis d'identifier la distribution gamma comme le meilleur compromis en termes de qualité de la modélisation, simplicité du modèle et rapidité de l'estimation des paramètres. Deux nouvelles méthodes de segmentation basées sur l'approche par ensemble de niveaux et la distribution gamma sont décrites. Des statistiques locales du signal d'enveloppe permettent de tenir compte des inhomogénéités du signal dues à l'atténuation et la focalisation des ultrasons. La méthode appelée LRGDF modélise les statistiques du speckle dans des régions dont la taille est contrôlable par une fonction lisse à support compact. La seconde, appelée STS-LS, considère des coupes transverses, perpendiculaires au faisceau, pour gagner en efficacité. Une troisième méthode basée sur la classification par forêt aléatoire a été conçue pour initialiser et accélérer les deux précédentes. Ces méthodes automatiques sont comparées à une segmentation manuelle effectuée par un expert. Elles fournissent des résultats satisfaisants aussi bien sur des données simulées que sur des données acquises sur des ganglions lymphatiques de patients atteints d'un cancer colorectal ou du sein. Les paramètres ultrasonores quantitatifs estimés après segmentation automatique ou après segmentation manuelle par un expert sont comparables
This work investigates approaches to obtain automatic segmentation of three media (i.e., lymph node parenchyma, perinodal fat and normal saline) in lymph node (LN) envelope data to expedite quantitative ultrasound (QUS) in dissected LNs from cancer patients. A statistical modeling study identified a two-parameter gamma distribution as the best model for data from the three media based on its high fitting accuracy, its analytically less-complex probability density function (PDF), and closed-form expressions for its parameter estimation. Two novel level-set segmentation methods that made use of localized statistics of envelope data to handle data inhomogeneities caused by attenuation and focusing effects were developed. The first, local region-based gamma distribution fitting (LRGDF), employed the gamma PDFs to model speckle statistics of envelope data in local regions at a controllable scale using a smooth function with a compact support. The second, statistical transverse-slice-based level-set (STS-LS), used gamma PDFs to locally model speckle statistics in consecutive transverse slices. A novel method was then designed and evaluated to automatically initialize the LRGDF and STS-LS methods using random forest classification with new proposed features. Methods developed in this research provided accurate, automatic and efficient segmentation results on simulated envelope data and data acquired for LNs from colorectal- and breast-cancer patients as compared with manual expert segmentation. Results also demonstrated that accurate QUS estimates are maintained when automatic segmentation is applied to evaluate excised LN data

Styles APA, Harvard, Vancouver, ISO, etc.

13

Pacheco, Do Espirito Silva Caroline. « Feature extraction and selection for background modeling and foreground detection ». Thesis, La Rochelle, 2017. http://www.theses.fr/2017LAROS005/document.

Texte intégral

Résumé :

Dans ce manuscrit de thèse, nous présentons un descripteur robuste pour la soustraction d’arrière-plan qui est capable de décrire la texture à partir d’une séquence d’images. Ce descripteur est moins sensible aux bruits et produit un histogramme court, tout en préservant la robustesse aux changements d’éclairage. Un autre descripteur pour la reconnaissance dynamique des textures est également proposé. Le descripteur permet d’extraire non seulement des informations de couleur, mais aussi des informations plus détaillées provenant des séquences vidéo. Enfin, nous présentons une approche de sélection de caractéristiques basée sur le principe d'apprentissage par ensemble qui est capable de sélectionner les caractéristiques appropriées pour chaque pixel afin de distinguer les objets de premier plan de l’arrière plan. En outre, notre proposition utilise un mécanisme pour mettre à jour l’importance relative de chaque caractéristique au cours du temps. De plus, une approche heuristique est utilisée pour réduire la complexité de la maintenance du modèle d’arrière-plan et aussi sa robustesse. Par contre, cette méthode nécessite un grand nombre de caractéristiques pour avoir une bonne précision. De plus, chaque classificateur de base apprend un ensemble de caractéristiques au lieu de chaque caractéristique individuellement. Pour compenser ces limitations, nous avons amélioré cette approche en proposant une nouvelle méthodologie pour sélectionner des caractéristiques basées sur le principe du « wagging ». Nous avons également adopté une approche basée sur le concept de « superpixel » au lieu de traiter chaque pixel individuellement. Cela augmente non seulement l’efficacité en termes de temps de calcul et de consommation de mémoire, mais aussi la qualité de la détection des objets mobiles
In this thesis, we present a robust descriptor for background subtraction which is able to describe texture from an image sequence. The descriptor is less sensitive to noisy pixels and produces a short histogram, while preserving robustness to illumination changes. Moreover, a descriptor for dynamic texture recognition is also proposed. This descriptor extracts not only color information, but also a more detailed information from video sequences. Finally, we present an ensemble for feature selection approach that is able to select suitable features for each pixel to distinguish the foreground objects from the background ones. Our proposal uses a mechanism to update the relative importance of each feature over time. For this purpose, a heuristic approach is used to reduce the complexity of the background model maintenance while maintaining the robustness of the background model. However, this method only reaches the highest accuracy when the number of features is huge. In addition, each base classifier learns a feature set instead of individual features. To overcome these limitations, we extended our previous approach by proposing a new methodology for selecting features based on wagging. We also adopted a superpixel-based approach instead of a pixel-level approach. This does not only increases the efficiency in terms of time and memory consumption, but also can improves the segmentation performance of moving objects

Styles APA, Harvard, Vancouver, ISO, etc.

14

Bahri, Maroua. « Improving IoT data stream analytics using summarization techniques ». Electronic Thesis or Diss., Institut polytechnique de Paris, 2020. http://www.theses.fr/2020IPPAT017.

Texte intégral

Résumé :

Face à cette évolution technologique vertigineuse, l’utilisation des dispositifs de l'Internet des Objets (IdO), les capteurs, et les réseaux sociaux, d'énormes flux de données IdO sont générées quotidiennement de différentes applications pourront être transformées en connaissances à travers l’apprentissage automatique. En pratique, de multiples problèmes se posent afin d’extraire des connaissances utiles de ces flux qui doivent être gérés et traités efficacement. Dans ce contexte, cette thèse vise à améliorer les performances (en termes de mémoire et de temps) des algorithmes de l'apprentissage supervisé, principalement la classification à partir de flux de données en évolution. En plus de leur nature infinie, la dimensionnalité élevée et croissante de ces flux données dans certains domaines rendent la tâche de classification plus difficile. La première partie de la thèse étudie l’état de l’art des techniques de classification et de réduction de dimension pour les flux de données, tout en présentant les travaux les plus récents dans ce cadre.La deuxième partie de la thèse détaille nos contributions en classification pour les flux de données. Il s’agit de nouvelles approches basées sur les techniques de réduction de données visant à réduire les ressources de calcul des classificateurs actuels, presque sans perte en précision. Pour traiter les flux de données de haute dimension efficacement, nous incorporons une étape de prétraitement qui consiste à réduire la dimension de chaque donnée (dès son arrivée) de manière incrémentale avant de passer à l’apprentissage. Dans ce contexte, nous présentons plusieurs approches basées sur: Bayesien naïf amélioré par les résumés minimalistes et hashing trick, k-NN qui utilise compressed sensing et UMAP, et l’utilisation d’ensembles d’apprentissage également
With the evolution of technology, the use of smart Internet-of-Things (IoT) devices, sensors, and social networks result in an overwhelming volume of IoT data streams, generated daily from several applications, that can be transformed into valuable information through machine learning tasks. In practice, multiple critical issues arise in order to extract useful knowledge from these evolving data streams, mainly that the stream needs to be efficiently handled and processed. In this context, this thesis aims to improve the performance (in terms of memory and time) of existing data mining algorithms on streams. We focus on the classification task in the streaming framework. The task is challenging on streams, principally due to the high -- and increasing -- data dimensionality, in addition to the potentially infinite amount of data. The two aspects make the classification task harder.The first part of the thesis surveys the current state-of-the-art of the classification and dimensionality reduction techniques as applied to the stream setting, by providing an updated view of the most recent works in this vibrant area.In the second part, we detail our contributions to the field of classification in streams, by developing novel approaches based on summarization techniques aiming to reduce the computational resource of existing classifiers with no -- or minor -- loss of classification accuracy. To address high-dimensional data streams and make classifiers efficient, we incorporate an internal preprocessing step that consists in reducing the dimensionality of input data incrementally before feeding them to the learning stage. We present several approaches applied to several classifications tasks: Naive Bayes which is enhanced with sketches and hashing trick, k-NN by using compressed sensing and UMAP, and also integrate them in ensemble methods

Styles APA, Harvard, Vancouver, ISO, etc.

15

NIGAM, HARSHIT. « SOFTWARE DEFECT PREDICTION USING ENSEMBLE OF MACHINE LEARNING TECHNIQUE ». Thesis, 2020. http://dspace.dtu.ac.in:8080/jspui/handle/repository/18102.

Texte intégral

Résumé :

Presently a days research on software defect prediction has pulled in numerous scientists since it helps in production of effective software. Extra bit of leeway is that it helps in decrease of the software advancement cost and encourages strategies to recognize the purposes behind deciding the level of defect-inclined software in future. For explicit kinds of AI, there is no convincing proof that will be more productive and precise in anticipating software defects. A portion of the past related work, in any case, proposes the learning strategies of the ensemble as a more exact other option. This work presents the resample method with three kinds of ensemble students; boosting, stowing, stacking and casting a ballot utilizing four base students on various variants of same dataset storehouse gave in the PROMISE archive. Results show that precision has been improved utilizing ensemble strategies more than single leaners.

Styles APA, Harvard, Vancouver, ISO, etc.

16

YADAV, MAYANK. « USE OF ENSEMBLE LEARNERS TO PREDICT NUMBER OF DEFECTS IN A SOFTWARE ». Thesis, 2023. http://dspace.dtu.ac.in:8080/jspui/handle/repository/19838.

Texte intégral

Résumé :

Presently Fault detection is crucial in industry. Early discovery of faults may aid in the prevention of subsequent abnormal events. Fault detection can be achieved in a variety of ways. This research will go through the fundamental approaches. At this moment, methods for finding flaws faster than the customary time restriction are necessary. Detection methods include data and signal approaches, process model-based methods, and knowledge-based methods. Some treatments need very precise models. Early issue discovery increases life expectancy, enhances safety, and lowers maintenance costs. When choosing a fault detection system, several factors must be considered. Principal Component Analysis can help find flaws in large-scale systems. Signal models are used when difficulties arise as a result of process changes. This research includes a systematic review from the literature, along with a selection of noteworthy applications. In this research, we would want to go through different real-world scenarios that employ different defect detection methodologies. In other words, we will look at both hardware and software concerns. The first case considers fault detection, and a decision tree technique is utilized to detect these defective lines. The algorithm is designed to categorize as defective or non-faulty whenever possible. In second scenario, to discover faults in each dataset, we shall employ the "ensemble learning" learning technique. We will be working on the datasets. During testing activity, software shows occurrences of multiple defects. And, that too capable of causing instant failures; thereby decreasing the software’s capability.

Styles APA, Harvard, Vancouver, ISO, etc.

17

JAWA, MISHA. « COMPARISION OF ENSEMBLE LEARNING MODELS AND IMPACT OF DATA BALANCING TECHNIQUE FOR SOFTWARE EFFORT ESTIMATION ». Thesis, 2022. http://dspace.dtu.ac.in:8080/jspui/handle/repository/19229.

Texte intégral

Résumé :

Project management is a critical component of every software project's success. Estimating the cost and effort of software development at the outset of the project is one of the most important responsibilities in software project management. Estimating effort allows project managers to more effectively manage resources and activities. The primary purpose of this study was to construct and compare the usage of two common ensemble approaches (bagging and boosting) to improve estimator accuracy and to study the impact of Synthetic Minority Over-Sampling Technique for Regression (SMOTER) to predict effort estimation by using machine learning algorithms. Random forest, support vector regression, elastic net, decision tree regressor, linear regression, lasso regression, and ridge regression are some of the machine learning techniques we've implemented. For our study we used Albrecht, China, COCOMO81, Desharnais and Maxwell dataset. We also performed feature selection and considered only those features that have strong correlation with target feature i.e., effort. The two-performance metrics Mean Magnitude Relative Error (MMRE) and PRED(25) results demonstrate that utilising elastic net as the base learner for AdaBoost outperforms the other models and there is a significant decrease in error of each model after applying SMOTER.

Styles APA, Harvard, Vancouver, ISO, etc.

18

Dolo, Kgaugelo Moses. « Differential evolution technique on weighted voting stacking ensemble method for credit card fraud detection ». Diss., 2019. http://hdl.handle.net/10500/26758.

Texte intégral

Résumé :

Differential Evolution is an optimization technique of stochastic search for a population-based vector, which is powerful and efficient over a continuous space for solving differentiable and non-linear optimization problems. Weighted voting stacking ensemble method is an important technique that combines various classifier models. However, selecting the appropriate weights of classifier models for the correct classification of transactions is a problem. This research study is therefore aimed at exploring whether the Differential Evolution optimization method is a good approach for defining the weighting function. Manual and random selection of weights for voting credit card transactions has previously been carried out. However, a large number of fraudulent transactions were not detected by the classifier models. Which means that a technique to overcome the weaknesses of the classifier models is required. Thus, the problem of selecting the appropriate weights was viewed as the problem of weights optimization in this study. The dataset was downloaded from the Kaggle competition data repository. Various machine learning algorithms were used to weight vote a class of transaction. The differential evolution optimization techniques was used as a weighting function. In addition, the Synthetic Minority Oversampling Technique (SMOTE) and Safe Level Synthetic Minority Oversampling Technique (SL-SMOTE) oversampling algorithms were modified to preserve the definition of SMOTE while improving the performance. Result generated from this research study showed that the Differential Evolution Optimization method is a good weighting function, which can be adopted as a systematic weight function for weight voting stacking ensemble method of various classification methods.
School of Computing
M. Sc. (Computing)

Styles APA, Harvard, Vancouver, ISO, etc.

19

Pisani, Francesco Sergio, Felice Crupi et Gianluigi Folino. « Ensemble learning techniques for cyber security applications ». Thesis, 2017. http://hdl.handle.net/handle/10955/1873.

Texte intégral

Résumé :

Dottorato di Ricerca in Information and Communication Engineering For Pervasive Intelligent Environments, Ciclo XXIX
Cyber security involves protecting information and systems from major cyber threats; frequently, some high-level techniques, such as for instance data mining techniques, are be used to efficiently fight, alleviate the effect or to prevent the action of the cybercriminals. In particular, classification can be efficiently used for many cyber security application, i.e. in intrusion detection systems, in the analysis of the user behavior, risk and attack analysis, etc. However, the complexity and the diversity of modern systems opened a wide range of new issues difficult to address. In fact, security softwares have to deal with missing data, privacy limitation and heterogeneous sources. Therefore, it would be really unlikely a single classification algorithm will perform well for all the types of data, especially in presence of changes and with constraints of real time and scalability. To this aim, this thesis proposes a framework based on the ensemble paradigm to cope with these problems. Ensemble is a learning paradigm where multiple learners are trained for the same task by a learning algorithm, and the predictions of the learners are combined for dealing with new unseen instances. The ensemble method helps to reduce the variance of the error, the bias, and the dependence from a single dataset; furthermore, it can be build in an incremental way and it is apt to distributed implementations. It is also particularly suitable for distributed intrusion detection, because it permits to build a network profile by combining different classifiers that together provide complementary information. However, the phase of building of the ensemble could be computationally expensive as when new data arrives, it is necessary to restart the training phase. For this reason, the framework is based on Genetic Programming to evolve a function for combining the classifiers composing the ensemble, having some attractive characteristics. First, the models composing the ensemble can be trained only on a portion of the training set, and then they can be combined and used without any extra phase of training. Moreover the models can be specialized for a single class and they can be designed to handle the difficult problems of unbalanced classes and missing data. In case of changes in the data, the function can be recomputed in an incrementally way, with a moderate computational effort and, in a streaming environment, drift strategies can be used to update the models. In addition, all the phases of the algorithm are distributed and can exploits the advantages of running on parallel/ distributed architectures to cope with real time constraints. The framework is oriented and specialized towards cyber security applications. For this reason, the algorithm is designed to work with missing data, unbalanced classes, models specialized on some tasks and model working with streaming data. Two typical scenarios in the cyber security domain are provided and some experiment are conducted on artificial and real datasets to test the effectiveness of the approach. The first scenario deals with user behavior. The actions taken by users could lead to data breaches and the damages could have a very high cost. The second scenario deals with intrusion detection system. In this research area, the ensemble paradigm is a very new technique and the researcher must completely understand the advantages of this solution.
Università della Calabria

Styles APA, Harvard, Vancouver, ISO, etc.

20

NIGAM, HARSHIT. « SOFTWARE DEFECT PREDICTION USING ENSEMBLE OF MACHINE LEARNING TECHNIQUES ». Thesis, 2020. http://dspace.dtu.ac.in:8080/jspui/handle/repository/18103.

Texte intégral

Résumé :

Presently a days research on software defect prediction has pulled in numerous scientists since it helps in production of effective software. Extra bit of leeway is that it helps in decrease of the software advancement cost and encourages strategies to recognize the purposes behind deciding the level of defect-inclined software in future. For explicit kinds of AI, there is no convincing proof that will be more productive and precise in anticipating software defects. A portion of the past related work, in any case, proposes the learning strategies of the ensemble as a more exact other option. This work presents the resample method with three kinds of ensemble students; boosting, stowing, stacking and casting a ballot utilizing four base students on various variants of same dataset storehouse gave in the PROMISE archive. Results show that precision has been improved utilizing ensemble strategies more than single leaners.

Styles APA, Harvard, Vancouver, ISO, etc.

21

Amaro, Miguel Mendes. « Credit scoring : comparison of non‐parametric techniques against logistic regression ». Master's thesis, 2020. http://hdl.handle.net/10362/99692.

Texte intégral

Résumé :

Dissertation presented as the partial requirement for obtaining a Master's degree in Information Management, specialization in Knowledge Management and Business Intelligence
Over the past decades, financial institutions have been giving increased importance to credit risk management as a critical tool to control their profitability. More than ever, it became crucial for these institutions to be able to well discriminate between good and bad clients for only accepting the credit applications that are not likely to default. To calculate the probability of default of a particular client, most financial institutions have credit scoring models based on parametric techniques. Logistic regression is the current industry standard technique in credit scoring models, and it is one of the techniques under study in this dissertation. Although it is regarded as a robust and intuitive technique, it is still not free from several critics towards the model assumptions it takes that can compromise its predictions. This dissertation intends to evaluate the gains in performance resulting from using more modern non-parametric techniques instead of logistic regression, performing a model comparison over four different real-life credit datasets. Specifically, the techniques compared against logistic regression in this study consist of two single classifiers (decision tree and SVM with RBF kernel) and two ensemble methods (random forest and stacking with cross-validation). The literature review demonstrates that heterogeneous ensemble approaches have a weaker presence in credit scoring studies and, because of that, stacking with cross-validation was considered in this study. The results demonstrate that logistic regression outperforms the decision tree classifier, has similar performance in relation to SVM and slightly underperforms both ensemble approaches in similar extents.

Styles APA, Harvard, Vancouver, ISO, etc.

22

Reichenbach, Jonas. « Credit scoring with advanced analytics : applying machine learning methods for credit risk assessment at the Frankfurter sparkasse ». Master's thesis, 2018. http://hdl.handle.net/10362/49557.

Texte intégral

Résumé :

Project Work presented as the partial requirement for obtaining a Master's degree in Information Management, specialization in Information Systems and Technologies Management
The need for controlling and managing credit risk obliges financial institutions to constantly reconsider their credit scoring methods. In the recent years, machine learning has shown improvement over the common traditional methods for the application of credit scoring. Even small improvements in prediction quality are of great interest for the financial institutions. In this thesis classification methods are applied to the credit data of the Frankfurter Sparkasse to score their credits. Since recent research has shown that ensemble methods deliver outstanding prediction quality for credit scoring, the focus of the model investigation and application is set on such methods. Additionally, the typical imbalanced class distribution of credit scoring datasets makes us consider sampling techniques, which compensate the imbalances for the training dataset. We evaluate and compare different types of models and techniques according to defined metrics. Besides delivering a high prediction quality, the model’s outcome should be interpretable as default probabilities. Hence, calibration techniques are considered to improve the interpretation of the model’s scores. We find ensemble methods to deliver better results than the best single model. Specifically, the method of the Random Forest delivers the best performance on the given data set. When compared to the traditional credit scoring methods of the Frankfurter Sparkasse, the Random Forest shows significant improvement when predicting a borrower’s default within a 12-month period. The Logistic Regression is used as a benchmark to validate the performance of the model.

Styles APA, Harvard, Vancouver, ISO, etc.

23

Dias, Didier Narciso. « Soil Classification Resorting to Machine Learning Techniques ». Master's thesis, 2019. http://hdl.handle.net/10362/125335.

Texte intégral

Résumé :

Soil classification is the act of resuming the most relevant information about a soil profile into a single class, from which we can infer a large amount of properties without extensive knowledge of the subject. These classes then make the communication of soils, and how they can best be used in areas such as agriculture and forestry, simpler and easier to understand. Unfortunately soil classification is expensive and requires that specialists perform varied experiments, to be able to precisely attribute a class to a soil profile. This master’s thesis focuses on machine learning algorithms for soil classification mainly based on its intrinsic attributes, in the Mexico region. The data set used contains 6 760 soil profiles, the 19 464 horizons that constitute them, as well as physical and chemical properties, such as pH or organic content, belonging to those horizons. Four data modelling methods were tested (i.e., standard depths, n first layers, thickness, and area weighted thickness), as well as different values for a k-Nearest Neighbours imputation. A comparison between state of the art machine learning algorithms was also made, namely Random Forests, Gradient Tree Boosting, Deep Neural Networks and Recurrent Neural Networks. All of our modelling methods provided very similar results, when properly parametrised, reaching Kappa values of 0.504 and an accuracy of 0.554, with the standard depths method providing the most consistent results. The k parameter for the imputation showed very little impact on the variation on the results. Gradient Tree Boosting was the algorithm with the best overall results, closely followed by the Random Forests model. The neuron based methods never achieved a Kappa score over 0.4, therefore providing substantially worse results.
A classificação de solos é o ato de resumir a informação sobre um perfil do solo em uma única classe, da qual é possivel inferir várias propriedades, mesmo com a ausência de conhecimento sobre a área de estudo. Estas classes fazem a comunicação dos solos e de como estes podem ser usados, em áreas como a agricultura e silvicultura, mais simples de perceber. Infelizmente a classificação de solos é dispendiosa, demorada, e requer especialistas para realizar as experiências necessárias para classificar corretamente o solo em causa. A presente tese de mestrado focou-se na avaliação de algoritmos de aprendizagem automática para o problema de classificação de solos, baseada maioritariamente nos atributos intrínsecos destes, na região do México. Foi utilizada uma base de dados contendo 6 760 perfis de solos, os 19 464 horizontes que os constituem, e as propriedades químicas e físicas, como o pH e a percentagem de barro, pertencentes a esses horizontes. Quatro métodos de modelação de dados foram testados (standard depths, n first layers, thickness, e area weighted thickness), tal como diferentes valores para uma imputação baseada em k-Nearest Neighbours. Também foi realizada uma comparação entre algoritmos de aprendizagem automática, nomeadamente Random Forests, Gradient Tree Boosting, Deep Neural Networks e Recurrent Neural Networks. Todas as modelações de dados providenciaram resultados similares, quando propriamente parametrisados, atingindo valores de Kappa de 0.504 e accuracy de 0.554, sendo que o métdodo standard depths obteve uma performance mais consistente. O parâmetro k, referente ao método de imputação, revelou ter pouco impacto na variação dos resultados. O algoritmo Gradient Tree Boosting foi o que obteve melhores resultados, seguido de perto pelo modelo de Random Forests. Os métodos baseados em neurónios tiveram resultados substancialmente piores, nunca superando um valor de Kappa de 0.4.

Styles APA, Harvard, Vancouver, ISO, etc.

Nous offrons des réductions sur tous les plans premium pour les auteurs dont les œuvres sont incluses dans des sélections littéraires thématiques. Contactez-nous pour obtenir un code promo unique!