To see the other types of publications on this topic, follow the link: Bagging Forest.

Dissertations / Theses on the topic 'Bagging Forest'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 16 dissertations / theses for your research on the topic 'Bagging Forest.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Rosales, Martínez Octavio. "Caracterización de especies en plasma frío mediante análisis de espectroscopia de emisión óptica por técnicas de Machine Learning." Tesis de maestría, Universidad Autónoma del Estado de México, 2020. http://hdl.handle.net/20.500.11799/109734.

Full text
Abstract:
La espectroscopía de emisión óptica es una técnica que permite la identificación de elementos químicos usando el espectro electromagnético que emite un plasma. Con base en la literatura. tiene aplicaciones diversas, por ejemplo: en la identificación de entes estelares, para determinar el punto final de los procesos de plasma en la fabricación de semiconductores o bien, específicamente en este trabajo, se tratan espectros para la determinación de elementos presentes en la degradación de compuestos recalcitrantes. En este documento se identifican automáticamente espectros de elementos tales como He, Ar, N, O, y Hg, en sus niveles de energía uno y dos, mediante técnicas de Machine Learning (ML). En primer lugar, se descargan las líneas de elementos reportadas en el NIST (National Institute of Standards and Technology), después se preprocesan y unifican para los siguientes procesos: a) crear un generador de 84 espectros sintéticos implementado en Python y el módulo ipywidgets de Jupyter Notebook, con las posibilidades de elegir un elemento, nivel de energía, variar la temperatura, anchura a media altura, y normalizar el especto y, b) extraer las líneas para los elementos He, Ar, N, O y Hg en el rango de los 200 nm a 890 nm, posteriormente, se les aplica sobremuestreo para realizar la búsqueda de hiperparámetros para los algoritmos: Decision Tree, Bagging, Random Forest y Extremely Randomized Trees basándose en los principios del diseño de experimentos de aleatorización, replicación, bloqueo y estratificación.
APA, Harvard, Vancouver, ISO, and other styles
2

Булах, В. А., Л. О. Кіріченко, and Т. А. Радівілова. "Classification of Multifractal Time Series by Decision Tree Methods." Thesis, КНУ, 2018. http://openarchive.nure.ua/handle/document/5840.

Full text
Abstract:
The article considers classification task of model fractal time series by the methods of machine learning. To classify the series, it is proposed to use the meta algorithms based on decision trees. To modeling the fractal time series, binomial stochastic cascade processes are used. Classification of time series by the ensembles of decision trees models is carried out. The analysis indicates that the best results are obtained by the methods of bagging and random forest which use regression trees.
APA, Harvard, Vancouver, ISO, and other styles
3

Assareh, Amin. "OPTIMIZING DECISION TREE ENSEMBLES FOR GENE-GENE INTERACTION DETECTION." Kent State University / OhioLINK, 2012. http://rave.ohiolink.edu/etdc/view?acc_num=kent1353971575.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Yang, Kaolee. "A Statistical Analysis of Medical Data for Breast Cancer and Chronic Kidney Disease." Bowling Green State University / OhioLINK, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=bgsu1587052897029939.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Zoghi, Zeinab. "Ensemble Classifier Design and Performance Evaluation for Intrusion Detection Using UNSW-NB15 Dataset." University of Toledo / OhioLINK, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=toledo1596756673292254.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Ulriksson, Marcus, and Shahin Armaki. "Analys av prestations- och prediktionsvariabler inom fotboll." Thesis, Uppsala universitet, Statistiska institutionen, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-324983.

Full text
Abstract:
Uppsatsen ämnar att försöka förklara hur olika variabler angående matchbilden i en fotbollsmatch påverkar slutresultatet. Dessa variabler är uppdelade i prestationsvariabler och kvalitétsvariabler. Prestationsvariablerna är baserade på prestationsindikatorer inspirerat av Hughes och Bartlett (2002). Kvalitétsvariablerna förklarar hur bra de olika lagen är. Som verktyg för att uppnå syftet används olika klassificeringsmodeller utifrån både prestationsvariablerna och kvalitétsvariablerna. Först undersöktes vilka prestationsindikatorer som var viktigast. Den bästa modellen klassificerade cirka 60 % rätt och rensningar och skott på mål var de viktigaste prestationsvariablerna. Sedan undersöktes vilka prediktionsvariabler som var bäst. Den bästa modellen klassificerade rätt slutresultat cirka 88 % av matcherna. Utifrån vad författarna ansågs vara de viktigaste prediktionsvariablerna skapades en prediktionsmodell med färre variabler. Denna lyckades klassificera rätt cirka 86 % av matcherna. Prediktionsmodellen var konstruerad med spelarbetyg, odds på oavgjort och domare.
APA, Harvard, Vancouver, ISO, and other styles
7

Rosales, Elisa Renee. "Predicting Patient Satisfaction With Ensemble Methods." Digital WPI, 2015. https://digitalcommons.wpi.edu/etd-theses/595.

Full text
Abstract:
Health plans are constantly seeking ways to assess and improve the quality of patient experience in various ambulatory and institutional settings. Standardized surveys are a common tool used to gather data about patient experience, and a useful measurement taken from these surveys is known as the Net Promoter Score (NPS). This score represents the extent to which a patient would, or would not, recommend his or her physician on a scale from 0 to 10, where 0 corresponds to "Extremely unlikely" and 10 to "Extremely likely". A large national health plan utilized automated calls to distribute such a survey to its members and was interested in understanding what factors contributed to a patient's satisfaction. Additionally, they were interested in whether or not NPS could be predicted using responses from other questions on the survey, along with demographic data. When the distribution of various predictors was compared between the less satisfied and highly satisfied members, there was significant overlap, indicating that not even the Bayes Classifier could successfully differentiate between these members. Moreover, the highly imbalanced proportion of NPS responses resulted in initial poor prediction accuracy. Thus, due to the non-linear structure of the data, and high number of categorical predictors, we have leveraged flexible methods, such as decision trees, bagging, and random forests, for modeling and prediction. We further altered the prediction step in the random forest algorithm in order to account for the imbalanced structure of the data.
APA, Harvard, Vancouver, ISO, and other styles
8

Alsouda, Yasser. "An IoT Solution for Urban Noise Identification in Smart Cities : Noise Measurement and Classification." Thesis, Linnéuniversitetet, Institutionen för fysik och elektroteknik (IFE), 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-80858.

Full text
Abstract:
Noise is defined as any undesired sound. Urban noise and its effect on citizens area significant environmental problem, and the increasing level of noise has become a critical problem in some cities. Fortunately, noise pollution can be mitigated by better planning of urban areas or controlled by administrative regulations. However, the execution of such actions requires well-established systems for noise monitoring. In this thesis, we present a solution for noise measurement and classification using a low-power and inexpensive IoT unit. To measure the noise level, we implement an algorithm for calculating the sound pressure level in dB. We achieve a measurement error of less than 1 dB. Our machine learning-based method for noise classification uses Mel-frequency cepstral coefficients for audio feature extraction and four supervised classification algorithms (that is, support vector machine, k-nearest neighbors, bootstrap aggregating, and random forest). We evaluate our approach experimentally with a dataset of about 3000 sound samples grouped in eight sound classes (such as car horn, jackhammer, or street music). We explore the parameter space of the four algorithms to estimate the optimal parameter values for the classification of sound samples in the dataset under study. We achieve noise classification accuracy in the range of 88% – 94%.
APA, Harvard, Vancouver, ISO, and other styles
9

Thorén, Daniel. "Radar based tank level measurement using machine learning : Agricultural machines." Thesis, Linköpings universitet, Programvara och system, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-176259.

Full text
Abstract:
Agriculture is becoming more dependent on computerized solutions to make thefarmer’s job easier. The big step that many companies are working towards is fullyautonomous vehicles that work the fields. To that end, the equipment fitted to saidvehicles must also adapt and become autonomous. Making this equipment autonomoustakes many incremental steps, one of which is developing an accurate and reliable tanklevel measurement system. In this thesis, a system for tank level measurement in a seedplanting machine is evaluated. Traditional systems use load cells to measure the weightof the tank however, these types of systems are expensive to build and cumbersome torepair. They also add a lot of weight to the equipment which increases the fuel consump-tion of the tractor. Thus, this thesis investigates the use of radar sensors together witha number of Machine Learning algorithms. Fourteen radar sensors are fitted to a tankat different positions, data is collected, and a preprocessing method is developed. Then,the data is used to test the following Machine Learning algorithms: Bagged RegressionTrees (BG), Random Forest Regression (RF), Boosted Regression Trees (BRT), LinearRegression (LR), Linear Support Vector Machine (L-SVM), Multi-Layer Perceptron Re-gressor (MLPR). The model with the best 5-fold crossvalidation scores was Random For-est, closely followed by Boosted Regression Trees. A robustness test, using 5 previouslyunseen scenarios, revealed that the Boosted Regression Trees model was the most robust.The radar position analysis showed that 6 sensors together with the MLPR model gavethe best RMSE scores.In conclusion, the models performed well on this type of system which shows thatthey might be a competitive alternative to load cell based systems.
APA, Harvard, Vancouver, ISO, and other styles
10

Feng, Wei. "Investigation of training data issues in ensemble classification based on margin concept : application to land cover mapping." Thesis, Bordeaux 3, 2017. http://www.theses.fr/2017BOR30016/document.

Full text
Abstract:
La classification a été largement étudiée en apprentissage automatique. Les méthodes d’ensemble, qui construisent un modèle de classification en intégrant des composants d’apprentissage multiples, atteignent des performances plus élevées que celles d’un classifieur individuel. La précision de classification d’un ensemble est directement influencée par la qualité des données d’apprentissage utilisées. Cependant, les données du monde réel sont souvent affectées par les problèmes de bruit d’étiquetage et de déséquilibre des données. La marge d'ensemble est un concept clé en apprentissage d'ensemble. Elle a été utilisée aussi bien pour l'analyse théorique que pour la conception d'algorithmes d'apprentissage automatique. De nombreuses études ont montré que la performance de généralisation d'un classifieur ensembliste est liée à la distribution des marges de ses exemples d'apprentissage. Ce travail se focalise sur l'exploitation du concept de marge pour améliorer la qualité de l'échantillon d'apprentissage et ainsi augmenter la précision de classification de classifieurs sensibles au bruit, et pour concevoir des ensembles de classifieurs efficaces capables de gérer des données déséquilibrées. Une nouvelle définition de la marge d'ensemble est proposée. C'est une version non supervisée d'une marge d'ensemble populaire. En effet, elle ne requière pas d'étiquettes de classe. Les données d'apprentissage mal étiquetées sont un défi majeur pour la construction d'un classifieur robuste que ce soit un ensemble ou pas. Pour gérer le problème d'étiquetage, une méthode d'identification et d'élimination du bruit d'étiquetage utilisant la marge d'ensemble est proposée. Elle est basée sur un algorithme existant d'ordonnancement d'instances erronées selon un critère de marge. Cette méthode peut atteindre un taux élevé de détection des données mal étiquetées tout en maintenant un taux de fausses détections aussi bas que possible. Elle s'appuie sur les valeurs de marge des données mal classifiées, considérant quatre différentes marges d'ensemble, incluant la nouvelle marge proposée. Elle est étendue à la gestion de la correction du bruit d'étiquetage qui est un problème plus complexe. Les instances de faible marge sont plus importantes que les instances de forte marge pour la construction d'un classifieur fiable. Un nouvel algorithme, basé sur une fonction d'évaluation de l'importance des données, qui s'appuie encore sur la marge d'ensemble, est proposé pour traiter le problème de déséquilibre des données. Cette méthode est évaluée, en utilisant encore une fois quatre différentes marges d'ensemble, vis à vis de sa capacité à traiter le problème de déséquilibre des données, en particulier dans un contexte multi-classes. En télédétection, les erreurs d'étiquetage sont inévitables car les données d'apprentissage sont typiquement issues de mesures de terrain. Le déséquilibre des données d'apprentissage est un autre problème fréquent en télédétection. Les deux méthodes d'ensemble proposées, intégrant la définition de marge la plus pertinente face à chacun de ces deux problèmes majeurs affectant les données d'apprentissage, sont appliquées à la cartographie d'occupation du sol
Classification has been widely studied in machine learning. Ensemble methods, which build a classification model by integrating multiple component learners, achieve higher performances than a single classifier. The classification accuracy of an ensemble is directly influenced by the quality of the training data used. However, real-world data often suffers from class noise and class imbalance problems. Ensemble margin is a key concept in ensemble learning. It has been applied to both the theoretical analysis and the design of machine learning algorithms. Several studies have shown that the generalization performance of an ensemble classifier is related to the distribution of its margins on the training examples. This work focuses on exploiting the margin concept to improve the quality of the training set and therefore to increase the classification accuracy of noise sensitive classifiers, and to design effective ensemble classifiers that can handle imbalanced datasets. A novel ensemble margin definition is proposed. It is an unsupervised version of a popular ensemble margin. Indeed, it does not involve the class labels. Mislabeled training data is a challenge to face in order to build a robust classifier whether it is an ensemble or not. To handle the mislabeling problem, we propose an ensemble margin-based class noise identification and elimination method based on an existing margin-based class noise ordering. This method can achieve a high mislabeled instance detection rate while keeping the false detection rate as low as possible. It relies on the margin values of misclassified data, considering four different ensemble margins, including the novel proposed margin. This method is extended to tackle the class noise correction which is a more challenging issue. The instances with low margins are more important than safe samples, which have high margins, for building a reliable classifier. A novel bagging algorithm based on a data importance evaluation function relying again on the ensemble margin is proposed to deal with the class imbalance problem. In our algorithm, the emphasis is placed on the lowest margin samples. This method is evaluated using again four different ensemble margins in addressing the imbalance problem especially on multi-class imbalanced data. In remote sensing, where training data are typically ground-based, mislabeled training data is inevitable. Imbalanced training data is another problem frequently encountered in remote sensing. Both proposed ensemble methods involving the best margin definition for handling these two major training data issues are applied to the mapping of land covers
APA, Harvard, Vancouver, ISO, and other styles
11

Булах, В. А., Л. О. Кириченко, and Т. А. Радивилова. "Сравнительный анализ классификации мультифрактальных временных рядов." Thesis, 2018. http://openarchive.nure.ua/handle/document/5777.

Full text
Abstract:
В работе для определения принадлежности временного ряда к одному из классов были использованы методы Bagging и Random Forest. В каждом методе были задействованы ансамбли деревьев решений как классификации, так и регрессии. При использовании регрессионных деревьев решений результатом работы модели является вероятность соответствия мультифрактального каскада заданному классу. В зависимости от длины временного ряда и выбранного метода средняя вероятность предсказания класса изменяется в пределах (0.65, 0.93). Результаты показали, что использование регрессионных деревьев дает существенно большую точность по сравнению с деревьями классификации.
APA, Harvard, Vancouver, ISO, and other styles
12

Ganbayar, Otgonkhishig, and Otgonkhishig Ganbayar. "Predicting Credit Risk of Online Peer to Peer Lending by Applying Bagging and Random Forest Ensemble." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/2dr4te.

Full text
Abstract:
碩士
國立臺灣科技大學
資訊工程系
106
In his research thesis, we aim to analyze credit risk of Online Peer-to-Peer (P2P) lending that is the platform where individuals and businesses lend or borrow money each other through internet without any financial institution like bank. Even though the P2P system gives borrowers and investors some advantages comparing to bank deposit, it faces with a risk of the loan that is not repaid. The Lending Club platform’s publicly available 2015- 2017 loan historical dataset is used in that research. The raw datasets are preprocessed with some filtering method of cleaning data and resampled for training due to imbalance of the initial dataset. We proposed Bagging and Random Forest Ensemble machine learning algorithms for classification of loan status as good or bad loan and Entropy Based Feature Selection method for preprocessing techniques to explore, analyze and determine the factors which play crucial role in predicting the credit risk. The algorithms are optimized to distinguish the potential good loans whilst identifying defaults or bad loans. As well, other machine learning algorithms are applied to compare our proposed method’s effectiveness. The experiment results show that our proposed method can effectively raise the prediction accuracy for default risk.
APA, Harvard, Vancouver, ISO, and other styles
13

Lins, Stefan Martin. "Analyse und Vergleich des Modal Splits in den Jahren 2013 und 2018 auf Basis der SrV-Daten mithilfe von Random Forest." 2021. https://tud.qucosa.de/id/qucosa%3A74086.

Full text
Abstract:
Der hohe Anteil des Verkehrs an den Gesamtemissionen, dem damit verbundenen Beitrag zum Klimawandel sowie der extensive Flächenverbrauch des Individualverkehrs verstärken die politischen Forderungen nach einer Verkehrswende. Das Ziel dieser Arbeit ist es, mithilfe ausführlich methodisch dargestellter Verfahren des maschinellen Lernens ein optimales Klassifikationsmodell zu entwickeln. Dieses ermöglicht die Evaluation und Prognose der Verkehrsmittelwahl und damit den Modal Split auf Basis verschiedener Einflussfaktoren insbesondere im Zeitverlauf zwischen 2013 und 2018. Bisherige Untersuchungen konzentrieren sich auf außereuropäische Gebiete und einmalige Erhebungsdurchläufe. Für die Analyse wird auf die von der Technischen Universität Dresden durchgeführte Mobilitätsbefragung 'SrV - Mobilität in Städten' für die 25 großen deutschen Vergleichsstädte der Jahre 2013 und 2018 zurückgegriffen. Nach der Datenaufbereitung werden unter Verwendung deskriptiver Methoden und Zusammenhangsmaße die einzelnen Merkmalsvariablen auf die Eignung in der Modellbildung beurteilt, um möglichst aussagekräftige Modellergebnisse zu erhalten. Basierend auf CART-Entscheidungsbäumen werden Modelle mit dem Bagging-, Random Forest- und dem Boosting-Algorithmus für beide Jahre erstellt. Zur Einordnung der Effektivität der Modelle werden ebenfalls Modelle für Künstliche Neuronale Netzwerke und der Multinomialen Logistischen Regression für beide Jahre untersucht. Auf Basis von Random Forest, das insgesamt in der Untersuchung mit einer Gesamttrefferquote von 82,9 % (AUC-Wert 0,9458) für 2013 und 79,8 % (AUC-Wert 0,9377) für 2018 die besten Gütemaße erzielt, werden die Einflussfaktoren mittels eines Variable Importance Plots und des Partial Dependence Plots beschrieben und ausgewertet. Insbesondere wird festgestellt, dass Länge und Dauer des Weges und die Verfügbarkeit einer Dauerkarte für den öffentlichen Verkehr den größten Einfluss auf die Verkehrsmittelwahl haben. Im Zeitverlauf fällt auf, dass insbesondere MIV-Wege durch Rad- und ÖV-Fahrten substituiert werden, während bei den Fußwegen nur geringe Veränderungen auffallen. Die geschätzten Klassifikationsmodelle erreichen überwiegend herausragende Vorhersagen der Verkehrsmittelwahl, wobei diese Prognosen für das Fahrrad sich am schwierigsten gestalten.:Inhaltsverzeichnis Abbildungsverzeichnis VII Tabellenverzeichnis XI Abkürzungsverzeichnis XIII Symbolverzeichnis XV 1 Einleitung 1 2 Literaturübersicht 3 3 Methodik 5 3.1 Entscheidungsbäume 5 3.1.1 Notation der Baumstruktur 5 3.1.2 Regressionsbäume 6 3.1.3 Klassifikationsbäume 6 3.1.4 Stutzen eines Baumes und Abbruchkriterien 9 3.1.5 Bewertung des Verfahrens 10 3.2 Bagging 11 3.2.1 Idee 11 3.2.2 Bootstrap 12 3.2.3 Subsampling 12 3.2.4 Prinzip des Bagging-Algorithmus 12 3.2.5 Bewertung des Verfahrens und Anpassung 15 3.3 Random Forest 16 3.3.1 Idee 16 3.3.2 Prinzip des Random-Forest-Algorithmus 17 3.3.3 Bewertung des Verfahrens und Anpassung 20 3.3.4 Bewertung der Einflussfaktoren 21 3.4 Boosting 23 3.4.1 Idee 23 3.4.2 Prinzip des AdaBoost-Verfahrens 24 3.4.3 Evaluation 25 3.5 Künstliches Neuronales Netzwerk 25 3.5.1 Idee 26 3.5.2 Prinzip des Künstlichen Neuronalen Netzwerks 26 3.5.3 Evaluation und Anpassungsparameter 29 3.6 Multinomiale Logistische Regression 30 3.7 Gütemaße 30 3.7.1 Trefferquote 30 3.7.2 ROC-Kurve und AUC 30 4 Daten 33 4.1 Datensatz 33 4.2 Datenaufbereitung 34 4.2.1 Auflösung der Multilevelstruktur 34 4.2.2 Daten in der Haushaltsebene 35 4.2.3 Daten in der Personenebene 36 4.2.4 Daten in der Wegeebene 37 4.2.5 Ausreißer und fehlende Werte 37 5 Deskriptive Analyse 39 5.1 Auswertung der kategorialen abhängigen Variablen 39 5.2 Auswertung der kardinalen Variablen 40 5.2.1 Streu- und Lagemaße 40 5.2.2 Korrelation zwischen den kardinalen Variablen 42 5.3 Auswertung der ordinalen und nominalen Variablen 43 5.3.1 Relative Häufigkeiten 43 5.3.2 Beurteilung der ordinalen und nominalen Variablen mithilfe des korrigierten Kontingenzkoeffizienten nach Pearson 46 5.4 Analyse statistischer Unterschiede der beiden untersuchten Stichproben 47 6 Ergebnisse der Modelle 49 6.1 Baumbasierte Klassifikationsverfahren 49 6.1.1 CART-Entscheidungsbäume 49 6.1.2 Bagging 52 6.1.3 Random Forest 53 6.1.4 Boosting 66 6.2 Künstliches Neuronales Netzwerk 69 6.3 Multinomiale Logistische Regression 71 7 Fazit 73 8 Kritische Würdigung und Ausblick 75 Literaturverzeichnis XIX Anhang XXV Danksagung LXI
The high share of traffic in total emissions, the associated contribution to climate change and the extensive land consumption of individual traffic reinforce the political demands for a traffic turnaround. The aim of this thesis is to develop an optimal classification model with the help of detailed methodical presented methods of machine learning. This enables the evaluation and forcast of the choice of means of transport and thus the modal split on the basis of various influencing factors, particularly over the course of time between 2013 and 2018. Previous studies have focused on non-European areas and one-off surveys. For the analysis, the mobility survey 'SrV-Mobilität in Städten' carried out by the Technische Universität Dresden for the 25 large German cities in 2013 and 2018 is used. After the data processing, the individual feature variables are assessed for their suitability in the modeling process using descriptive methods and correlation measures in order to obtain the most meaningful model results possible. Based on CART Decision Trees, models with the Bagging, Random Forest and Boosting algorithms are created for both years. To classify the effectiveness of the models, models for Artificial Neural Networks and Multinomial Logistic Regression are also examined for both years. Based on Random Forest, which achieved the best quality measures in the study with an overall accuracy of 82.9 % (AUC value 0.9458) for 2013 and 79.8 % (AUC value 0.9377) for 2018, the influencing factors are described and evaluated using a Variable Importance Plot and the Partial Dependence Plot. In particular, it is found that the length and duration of the journey and the availability of a season ticket for public transport have the greatest influence on the choice of the mode of transport. Over the course of time, it is noticeable that in particular motorized traffic routes are being replaced by cycling and public transport, while only minor changes are noticeable in the case of walking. Most of the estimated classification models achieve excellent predictions in the choice of mode of transport, although these predictions are the most difficult for the bicycle.:Inhaltsverzeichnis Abbildungsverzeichnis VII Tabellenverzeichnis XI Abkürzungsverzeichnis XIII Symbolverzeichnis XV 1 Einleitung 1 2 Literaturübersicht 3 3 Methodik 5 3.1 Entscheidungsbäume 5 3.1.1 Notation der Baumstruktur 5 3.1.2 Regressionsbäume 6 3.1.3 Klassifikationsbäume 6 3.1.4 Stutzen eines Baumes und Abbruchkriterien 9 3.1.5 Bewertung des Verfahrens 10 3.2 Bagging 11 3.2.1 Idee 11 3.2.2 Bootstrap 12 3.2.3 Subsampling 12 3.2.4 Prinzip des Bagging-Algorithmus 12 3.2.5 Bewertung des Verfahrens und Anpassung 15 3.3 Random Forest 16 3.3.1 Idee 16 3.3.2 Prinzip des Random-Forest-Algorithmus 17 3.3.3 Bewertung des Verfahrens und Anpassung 20 3.3.4 Bewertung der Einflussfaktoren 21 3.4 Boosting 23 3.4.1 Idee 23 3.4.2 Prinzip des AdaBoost-Verfahrens 24 3.4.3 Evaluation 25 3.5 Künstliches Neuronales Netzwerk 25 3.5.1 Idee 26 3.5.2 Prinzip des Künstlichen Neuronalen Netzwerks 26 3.5.3 Evaluation und Anpassungsparameter 29 3.6 Multinomiale Logistische Regression 30 3.7 Gütemaße 30 3.7.1 Trefferquote 30 3.7.2 ROC-Kurve und AUC 30 4 Daten 33 4.1 Datensatz 33 4.2 Datenaufbereitung 34 4.2.1 Auflösung der Multilevelstruktur 34 4.2.2 Daten in der Haushaltsebene 35 4.2.3 Daten in der Personenebene 36 4.2.4 Daten in der Wegeebene 37 4.2.5 Ausreißer und fehlende Werte 37 5 Deskriptive Analyse 39 5.1 Auswertung der kategorialen abhängigen Variablen 39 5.2 Auswertung der kardinalen Variablen 40 5.2.1 Streu- und Lagemaße 40 5.2.2 Korrelation zwischen den kardinalen Variablen 42 5.3 Auswertung der ordinalen und nominalen Variablen 43 5.3.1 Relative Häufigkeiten 43 5.3.2 Beurteilung der ordinalen und nominalen Variablen mithilfe des korrigierten Kontingenzkoeffizienten nach Pearson 46 5.4 Analyse statistischer Unterschiede der beiden untersuchten Stichproben 47 6 Ergebnisse der Modelle 49 6.1 Baumbasierte Klassifikationsverfahren 49 6.1.1 CART-Entscheidungsbäume 49 6.1.2 Bagging 52 6.1.3 Random Forest 53 6.1.4 Boosting 66 6.2 Künstliches Neuronales Netzwerk 69 6.3 Multinomiale Logistische Regression 71 7 Fazit 73 8 Kritische Würdigung und Ausblick 75 Literaturverzeichnis XIX Anhang XXV Danksagung LXI
APA, Harvard, Vancouver, ISO, and other styles
14

Λυπιτάκη, Αναστασία Δήμητρα Δανάη. "Μηχανική μάθηση σε ανομοιογενή δεδομένα." Thesis, 2014. http://hdl.handle.net/10889/8630.

Full text
Abstract:
Οι αλγόριθμοι μηχανικής μάθησης είναι επιθυμητό να είναι σε θέση να γενικεύσουν για οποιασδήποτε κλάση με ίδια ακρίβεια. Δηλαδή σε ένα πρόβλημα δύο κλάσεων - θετικών και αρνητικών περιπτώσεων - ο αλγόριθμος να προβλέπει με την ίδια ακρίβεια και τα θετικά και τα αρνητικά παραδείγματα. Αυτό είναι φυσικά η ιδανική κατάσταση. Σε πολλές εφαρμογές οι αλγόριθμοι καλούνται να μάθουν από ένα σύνολο στοιχείων, το οποίο περιέχει πολύ περισσότερα παραδείγματα από τη μια κλάση σε σχέση με την άλλη. Εν γένει, οι επαγωγικοί αλγόριθμοι είναι σχεδιασμένοι να ελαχιστοποιούν τα σφάλματα. Ως συνέπεια οι κλάσεις που περιέχουν λίγες περιπτώσεις μπορούν να αγνοηθούν κατά ένα μεγάλο μέρος επειδή το κόστος λανθασμένης ταξινόμησης της υπερ-αντιπροσωπευόμενης κλάσης ξεπερνά το κόστος λανθασμένης ταξινόμησης της μικρότερη κλάση. Το πρόβλημα των ανομοιογενών συνόλων δεδομένων εμφανίζεται και σε πολλές πραγματικές εφαρμογές όπως στην ιατρική διάγνωση, στη ρομποτική, στις διαδικασίες βιομηχανικής παραγωγής, στην ανίχνευση λαθών δικτύων επικοινωνίας, στην αυτοματοποιημένη δοκιμή του ηλεκτρονικού εξοπλισμού, και σε πολλές άλλες περιοχές. Η παρούσα διπλωματική εργασία με τίτλο ‘Μηχανική Μάθηση με Ανομοιογενή Δεδομένα’ (Machine Learning with Imbalanced Data) αναφέρεται στην επίλυση του προβλήματος αποδοτικής χρήσης αλγορίθμων μηχανικής μάθησης σε ανομοιογενή/ανισοκατανεμημένα δεδομένα. Η διπλωματική περιλαμβάνει μία γενική περιγραφή των βασικών αλγορίθμων μηχανικής μάθησης και των μεθόδων αντιμετώπισης του προβλήματος ανομοιογενών δεδομένων. Παρουσιάζεται πλήθος αλγοριθμικών τεχνικών διαχείρισης ανομοιογενών δεδομένων, όπως οι αλγόριθμοι AdaCost, Cost Senistive Boosting, Metacost και άλλοι. Παρατίθενται οι μετρικές αξιολόγησης των μεθόδων Μηχανικής Μάθησης σε ανομοιογενή δεδομένα, όπως οι καμπύλες διαχείρισης λειτουργικών χαρακτηριστικών (ROC curves), καμπύλες ακρίβειας (PR curves) και καμπύλες κόστους. Στο τελευταίο μέρος της εργασίας προτείνεται ένας υβριδικός αλγόριθμος που συνδυάζει τις τεχνικές OverBagging και Rotation Forest. Συγκρίνεται ο προτεινόμενος αλγόριθμος σε ένα σύνολο ανομοιογενών δεδομένων με άλλους αλγόριθμους και παρουσιάζονται τα αντίστοιχα πειραματικά αποτελέσματα που δείχνουν την καλύτερη απόδοση του προτεινόμενου αλγόριθμου. Τελικά διατυπώνονται τα συμπεράσματα της εργασίας και δίνονται χρήσιμες ερευνητικές κατευθύνσεις.
Machine Learning (ML) algorithms can generalize for every class with the same accuracy. In a problem of two classes, positive (true) and negative (false) cases-the algorithm can predict with the same accuracy the positive and negative examples that is the ideal case. In many applications ML algorithms are used in order to learn from data sets that include more examples from the one class in relationship with another class. In general inductive algorithms are designed in such a way that they can minimize the occurred errors. As a conclusion the classes that contain some cases can be ignored in a large percentage since the cost of the false classification of the super-represented class is greater than the cost of false classification of lower class. The problem of imbalanced data sets is occurred in many ‘real’ applications, such as medical diagnosis, robotics, industrial development processes, communication networks error detection, automated testing of electronic equipment and in other related areas. This dissertation entitled ‘Machine Learning with Imbalanced Data’ is referred to the solution of the problem of efficient use of ML algorithms with imbalanced data sets. The thesis includes a general description of basic ML algorithms and related methods for solving imbalanced data sets. A number of algorithmic techniques for handling imbalanced data sets is presented, such as Adacost, Cost Sensitive Boosting, Metacost and other algorithms. The evaluation metrics of ML methods for imbalanced datasets are presented, including the ROC (Receiver Operating Characteristic) curves, the PR (Precision and Recall) curves and cost curves. A new hybrid ML algorithm combining the OverBagging and Rotation Forest algorithms is introduced and the proposed algorithmic procedure is compared with other related algorithms by using the WEKA operational environment. Experimental results demonstrate the performance superiority of the proposed algorithm. Finally, the conclusions of this research work are presented and several future research directions are given.
APA, Harvard, Vancouver, ISO, and other styles
15

Rodríguez, Hernán Cortés. "Ensemble classifiers in remote sensing: a comparative analysis." Master's thesis, 2014. http://hdl.handle.net/10362/11671.

Full text
Abstract:
Dissertation submitted in partial fulfillment of the requirements for the Degree of Master of Science in Geospatial Technologies.
Land Cover and Land Use (LCLU) maps are very important tools for understanding the relationships between human activities and the natural environment. Defining accurately all the features over the Earth's surface is essential to assure their management properly. The basic data which are being used to derive those maps are remote sensing imagery (RSI), and concretely, satellite images. Hence, new techniques and methods able to deal with those data and at the same time, do it accurately, have been demanded. In this work, our goal was to have a brief review over some of the currently approaches in the scientific community to face this challenge, to get higher accuracy in LCLU maps. Although, we will be focus on the study of the classifiers ensembles and the different strategies that those ensembles present in the literature. We have proposed different ensembles strategies based in our data and previous work, in order to increase the accuracy of previous LCLU maps made by using the same data and single classifiers. Finally, only one of the ensembles proposed have got significantly higher accuracy, in the classification of LCLU map, than the better single classifier performance with the same data. Also, it was proved that diversity did not play an important role in the success of this ensemble.
APA, Harvard, Vancouver, ISO, and other styles
16

Kandel, Ibrahem Hamdy Abdelhamid. "A comparative study of tree-based models for churn prediction : a case study in the telecommunication sector." Master's thesis, 2019. http://hdl.handle.net/10362/60302.

Full text
Abstract:
Dissertation presented as the partial requirement for obtaining a Master's degree in Statistics and Information Management, specialization in Marketing Research e CRM
In the recent years the topic of customer churn gains an increasing importance, which is the phenomena of the customers abandoning the company to another in the future. Customer churn plays an important role especially in the more saturated industries like telecommunication industry. Since the existing customers are very valuable and the acquisition cost of new customers is very high nowadays. The companies want to know which of their customers and when are they going to churn to another provider, so that measures can be taken to retain the customers who are at risk of churning. Such measures could be in the form of incentives to the churners, but the downside is the wrong classification of a churners will cost the company a lot, especially when incentives are given to some non-churner customers. The common challenge to predict customer churn will be how to pre-process the data and which algorithm to choose, especially when the dataset is heterogeneous which is very common for telecommunication companies’ datasets. The presented thesis aims at predicting customer churn for telecommunication sector using different decision tree algorithms and its ensemble models.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography