Se connecter

Bibliographies thématiques / Random Decision Forests / Thèses

Thèses sur le sujet « Random Decision Forests »

Pour voir les autres types de publications sur ce sujet consultez le lien suivant : Random Decision Forests.

Auteur : Grafiati

Publié le 10 mars 2023

Créez une référence correcte selon les styles APA, MLA, Chicago, Harvard et plusieurs autres

Choisissez une source :

Consultez les 50 meilleures thèses pour votre recherche sur le sujet « Random Decision Forests ».

À côté de chaque source dans la liste de références il y a un bouton « Ajouter à la bibliographie ». Cliquez sur ce bouton, et nous générerons automatiquement la référence bibliographique pour la source choisie selon votre style de citation préféré : APA, MLA, Harvard, Vancouver, Chicago, etc.

Vous pouvez aussi télécharger le texte intégral de la publication scolaire au format pdf et consulter son résumé en ligne lorsque ces informations sont inclues dans les métadonnées.

Parcourez les thèses sur diverses disciplines et organisez correctement votre bibliographie.

1

Julock, Gregory Alan. « The Effectiveness of a Random Forests Model in Detecting Network-Based Buffer Overflow Attacks ». NSUWorks, 2013. http://nsuworks.nova.edu/gscis_etd/190.

Texte intégral

Résumé :

Buffer Overflows are a common type of network intrusion attack that continue to plague the networked community. Unfortunately, this type of attack is not well detected with current data mining algorithms. This research investigated the use of Random Forests, an ensemble technique that creates multiple decision trees, and then votes for the best tree. The research Investigated Random Forests' effectiveness in detecting buffer overflows compared to other data mining methods such as CART and Naïve Bayes. Random Forests was used for variable reduction, cost sensitive classification was applied, and each method's detection performance compared and reported along with the receive operator characteristics. The experiment was able to show that Random Forests outperformed CART and Naïve Bayes in classification performance. Using a technique to obtain Buffer Overflow most important variables, Random Forests was also able to improve upon its Buffer Overflow classification performance.

Styles APA, Harvard, Vancouver, ISO, etc.

2

Rosales, Elisa Renee. « Predicting Patient Satisfaction With Ensemble Methods ». Digital WPI, 2015. https://digitalcommons.wpi.edu/etd-theses/595.

Texte intégral

Résumé :

Health plans are constantly seeking ways to assess and improve the quality of patient experience in various ambulatory and institutional settings. Standardized surveys are a common tool used to gather data about patient experience, and a useful measurement taken from these surveys is known as the Net Promoter Score (NPS). This score represents the extent to which a patient would, or would not, recommend his or her physician on a scale from 0 to 10, where 0 corresponds to "Extremely unlikely" and 10 to "Extremely likely". A large national health plan utilized automated calls to distribute such a survey to its members and was interested in understanding what factors contributed to a patient's satisfaction. Additionally, they were interested in whether or not NPS could be predicted using responses from other questions on the survey, along with demographic data. When the distribution of various predictors was compared between the less satisfied and highly satisfied members, there was significant overlap, indicating that not even the Bayes Classifier could successfully differentiate between these members. Moreover, the highly imbalanced proportion of NPS responses resulted in initial poor prediction accuracy. Thus, due to the non-linear structure of the data, and high number of categorical predictors, we have leveraged flexible methods, such as decision trees, bagging, and random forests, for modeling and prediction. We further altered the prediction step in the random forest algorithm in order to account for the imbalanced structure of the data.

Styles APA, Harvard, Vancouver, ISO, etc.

3

Varatharajah, Thujeepan, et Eriksson Victor. « A comparative study on artificial neural networks and random forests for stock market prediction ». Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-186452.

Texte intégral

Résumé :

This study investigates the predictive performance of two different machine learning (ML) models on the stock market and compare the results. The chosen models are based on artificial neural networks (ANN) and random forests (RF). The models are trained on two separate data sets and the predictions are made on the next day closing price. The input vectors of the models consist of 6 different financial indicators which are based on the closing prices of the past 5, 10 and 20 days. The performance evaluation are done by analyzing and comparing such values as the root mean squared error (RMSE) and mean average percentage error (MAPE) for the test period. Specific behavior in subsets of the test period is also analyzed to evaluate consistency of the models. The results showed that the ANN model performed better than the RF model as it throughout the test period had lower errors compared to the actual prices and thus overall made more accurate predictions.
Denna studie undersöker hur väl två olika modeller inom maskininlärning (ML) kan förutspå aktiemarknaden och jämför sedan resultaten av dessa. De valda modellerna baseras på artificiella neurala nätverk (ANN) samt random forests (RF). Modellerna tränas upp med två separata datamängder och prognoserna sker på nästföljande dags stängningskurs. Indatan för modellerna består av 6 olika finansiella nyckeltal som är baserade på stängningskursen för de senaste 5, 10 och 20 dagarna. Prestandan utvärderas genom att analysera och jämföra värden som root mean squared error (RMSE) samt mean average percentage error (MAPE) för testperioden. Även specifika trender i delmängder av testperioden undersöks för att utvärdera följdriktigheten av modellerna. Resultaten visade att ANN-modellen presterade bättre än RF-modellen då den sett över hela testperioden visade mindre fel jämfört med de faktiska värdena och gjorde därmed mer träffsäkra prognoser.

Styles APA, Harvard, Vancouver, ISO, etc.

4

Pisetta, Vincent. « New Insights into Decision Trees Ensembles ». Thesis, Lyon 2, 2012. http://www.theses.fr/2012LYO20018/document.

Texte intégral

Résumé :

Les ensembles d’arbres constituent à l’heure actuelle l’une des méthodes d’apprentissage statistique les plus performantes. Toutefois, leurs propriétés théoriques, ainsi que leurs performances empiriques restent sujettes à de nombreuses questions. Nous proposons dans cette thèse d’apporter un nouvel éclairage à ces méthodes. Plus particulièrement, après avoir évoqué les aspects théoriques actuels (chapitre 1) de trois schémas ensemblistes principaux (Forêts aléatoires, Boosting et Discrimination Stochastique), nous proposerons une analyse tendant vers l’existence d’un point commun au bien fondé de ces trois principes (chapitre 2). Ce principe tient compte de l’importance des deux premiers moments de la marge dans l’obtention d’un ensemble ayant de bonnes performances. De là, nous en déduisons un nouvel algorithme baptisé OSS (Oriented Sub-Sampling) dont les étapes sont en plein accord et découlent logiquement du cadre que nous introduisons. Les performances d’OSS sont empiriquement supérieures à celles d’algorithmes en vogue comme les Forêts aléatoires et AdaBoost. Dans un troisième volet (chapitre 3), nous analysons la méthode des Forêts aléatoires en adoptant un point de vue « noyau ». Ce dernier permet d’améliorer la compréhension des forêts avec, en particulier la compréhension et l’observation du mécanisme de régularisation de ces techniques. Le fait d’adopter un point de vue noyau permet d’améliorer les Forêts aléatoires via des méthodes populaires de post-traitement comme les SVM ou l’apprentissage de noyaux multiples. Ceux-ci démontrent des performances nettement supérieures à l’algorithme de base, et permettent également de réaliser un élagage de l’ensemble en ne conservant qu’une petite partie des classifieurs le composant
Decision trees ensembles are among the most popular tools in machine learning. Nevertheless, their theoretical properties as well as their empirical performances are subject to strong investigation up to date. In this thesis, we propose to shed light on these methods. More precisely, after having described the current theoretical aspects of three main ensemble schemes (chapter 1), we give an analysis supporting the existence of common reasons to the success of these three principles (chapter 2). This last takes into account the two first moments of the margin as an essential ingredient to obtain strong learning abilities. Starting from this rejoinder, we propose a new ensemble algorithm called OSS (Oriented Sub-Sampling) whose steps are in perfect accordance with the point of view we introduce. The empirical performances of OSS are superior to the ones of currently popular algorithms such as Random Forests and AdaBoost. In a third chapter (chapter 3), we analyze Random Forests adopting a “kernel” point of view. This last allows us to understand and observe the underlying regularization mechanism of these kinds of methods. Adopting the kernel point of view also enables us to improve the predictive performance of Random Forests using popular post-processing techniques such as SVM and multiple kernel learning. In conjunction with random Forests, they show greatly improved performances and are able to realize a pruning of the ensemble by conserving only a small fraction of the initial base learners

Styles APA, Harvard, Vancouver, ISO, etc.

5

Funiok, Ondřej. « Využití statistických metod při oceňování nemovitostí ». Master's thesis, Vysoká škola ekonomická v Praze, 2017. http://www.nusl.cz/ntk/nusl-359241.

Texte intégral

Résumé :

The thesis deals with the valuation of real estates in the Czech Republic using statistical methods. The work focuses on a complex task based on data from an advertising web portal. The aim of the thesis is to create a prototype of the statistical predication model of the residential properties valuation in Prague and to further evaluate the dissemination of its possibilities. The structure of the work is conceived according to the CRISP-DM methodology. On the pre-processed data are tested the methods regression trees and random forests, which are used to predict the price of real estate.

Styles APA, Harvard, Vancouver, ISO, etc.

6

Jánoš, Andrej. « Vývoj kredit skóringových modelov s využitím vybraných štatistických metód v R ». Master's thesis, Vysoká škola ekonomická v Praze, 2016. http://www.nusl.cz/ntk/nusl-262242.

Texte intégral

Résumé :

Credit scoring is important and rapidly developing discipline. The aim of this thesis is to describe basic methods used for building and interpretation of the credit scoring models with an example of application of these methods for designing such models using statistical software R. This thesis is organized into five chapters. In chapter one, the term of credit scoring is explained with main examples of its application and motivation for studying this topic. In the next chapters, three in financial practice most often used methods for building credit scoring models are introduced. In chapter two, the most developed one, logistic regression is discussed. The main emphasis is put on the logistic regression model, which is characterized from a mathematical point of view and also various ways to assess the quality of the model are presented. The other two methods presented in this thesis are decision trees and Random forests, these methods are covered by chapters three and four. An important part of this thesis is a detailed application of the described models to a specific data set Default using the R program. The final fifth chapter is a practical demonstration of building credit scoring models, their diagnostics and subsequent evaluation of their applicability in practice using R. The appendices include used R code and also functions developed for testing of the final model and code used through the thesis. The key aspect of the work is to provide enough theoretical knowledge and practical skills for a reader to fully understand the mentioned models and to be able to apply them in practice.

Styles APA, Harvard, Vancouver, ISO, etc.

7

Heckman, Derek J. « A Comparison of Classification Methods in Predicting the Presence of DNA Profiles in Sexual Assault Kits ». Bowling Green State University / OhioLINK, 2018. http://rave.ohiolink.edu/etdc/view?acc_num=bgsu1513703948257233.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

8

Hellsing, Edvin, et Joel Klingberg. « It’s a Match : Predicting Potential Buyers of Commercial Real Estate Using Machine Learning ». Thesis, Uppsala universitet, Institutionen för informatik och media, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-445229.

Texte intégral

Résumé :

This thesis has explored the development and potential effects of an intelligent decision support system (IDSS) to predict potential buyers for commercial real estate property. The overarching need for an IDSS of this type has been identified exists due to information overload, which the IDSS aims to reduce. By shortening the time needed to process data, time can be allocated to make sense of the environment with colleagues. The system architecture explored consisted of clustering commercial real estate buyers into groups based on their characteristics, and training a prediction model on historical transaction data from the Swedish market from the cadastral and land registration authority. The prediction model was trained to predict which out of the cluster groups most likely will buy a given property. For the clustering, three different clustering algorithms were used and evaluated, one density based, one centroid based and one hierarchical based. The best performing clustering model was the centroid based (K-means). For the predictions, three supervised Machine learning algorithms were used and evaluated. The different algorithms used were Naive Bayes, Random Forests and Support Vector Machines. The model based on Random Forests performed the best, with an accuracy of 99.9%.
Denna uppsats har undersökt utvecklingen av och potentiella effekter med ett intelligent beslutsstödssystem (IDSS) för att prediktera potentiella köpare av kommersiella fastigheter. Det övergripande behovet av ett sådant system har identifierats existerar på grund av informtaionsöverflöd, vilket systemet avser att reducera. Genom att förkorta bearbetningstiden av data kan tid allokeras till att skapa förståelse av omvärlden med kollegor. Systemarkitekturen som undersöktes bestod av att gruppera köpare av kommersiella fastigheter i kluster baserat på deras köparegenskaper, och sedan träna en prediktionsmodell på historiska transkationsdata från den svenska fastighetsmarknaden från Lantmäteriet. Prediktionsmodellen tränades på att prediktera vilken av grupperna som mest sannolikt kommer köpa en given fastighet. Tre olika klusteralgoritmer användes och utvärderades för grupperingen, en densitetsbaserad, en centroidbaserad och en hierarkiskt baserad. Den som presterade bäst var var den centroidbaserade (K-means). Tre övervakade maskininlärningsalgoritmer användes och utvärderades för prediktionerna. Dessa var Naive Bayes, Random Forests och Support Vector Machines. Modellen baserad p ̊a Random Forests presterade bäst, med en noggrannhet om 99,9%.

Styles APA, Harvard, Vancouver, ISO, etc.

9

Федоров, Д. П. « Comparison of classifiers based on the decision tree ». Thesis, ХНУРЕ, 2021. https://openarchive.nure.ua/handle/document/16430.

Texte intégral

Résumé :

The main purpose of this work is to compare classifiers. Random Forest and XGBoost are two popular machine learning algorithms. In this paper, we looked at how they work, compared their features, and obtained accurate results from their robots.

Styles APA, Harvard, Vancouver, ISO, etc.

10

Boshoff, Wiehan. « Use of Adaptive Mobile Applications to Improve Mindfulness ». Wright State University / OhioLINK, 2018. http://rave.ohiolink.edu/etdc/view?acc_num=wright1527174546252577.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

11

Holloway, Jacinta. « Extending decision tree methods for the analysis of remotely sensed images ». Thesis, Queensland University of Technology, 2021. https://eprints.qut.edu.au/207763/1/Jacinta_Holloway_Thesis.pdf.

Texte intégral

Résumé :

One UN Sustainable Development Goal focuses on monitoring the presence, growth, and loss of forests. The cost of tracking progress towards this goal is often prohibitive. Satellite images provide an opportunity to use free data for environmental monitoring. However, these images have missing data due to cloud cover, particularly in the tropics. In this thesis I introduce fast and accurate new statistical methods to fill these data gaps. I create spatial and stochastic extensions of decision tree machine learning methods for interpolating missing data. I illustrate these methods with case studies monitoring forest cover in Australia and South America.

Styles APA, Harvard, Vancouver, ISO, etc.

12

Булах, В. А., Л. О. Кіріченко et Т. А. Радівілова. « Classification of Multifractal Time Series by Decision Tree Methods ». Thesis, КНУ, 2018. http://openarchive.nure.ua/handle/document/5840.

Texte intégral

Résumé :

The article considers classification task of model fractal time series by the methods of machine learning. To classify the series, it is proposed to use the meta algorithms based on decision trees. To modeling the fractal time series, binomial stochastic cascade processes are used. Classification of time series by the ensembles of decision trees models is carried out. The analysis indicates that the best results are obtained by the methods of bagging and random forest which use regression trees.

Styles APA, Harvard, Vancouver, ISO, etc.

13

Rico-Fontalvo, Florentino Antonio. « A Decision Support Model for Personalized Cancer Treatment ». Scholar Commons, 2014. https://scholarcommons.usf.edu/etd/5621.

Texte intégral

Résumé :

This work is motivated by the need of providing patients with a decision support system that facilitates the selection of the most appropriate treatment strategy in cancer treatment. Treatment options are currently subject to predetermined clinical pathways and medical expertise, but generally, do not consider the individual patient characteristics or preferences. Although genomic patient data are available, this information is rarely used in the clinical setting for real-life patient care. In the area of personalized medicine, the advancement in the fundamental understanding of cancer biology and clinical oncology can promote the prevention, detection, and treatment of cancer diseases. The objectives of this research are twofold. 1) To develop a patient-centered decision support model that can determine the most appropriate cancer treatment strategy based on subjective medical decision criteria, and patient's characteristics concerning the treatment options available and desired clinical outcomes; and 2) to develop a methodology to organize and analyze gene expression data and validate its accuracy as a predictive model for patient's response to radiation therapy (tumor radiosensitivity). The complexity and dimensionality of the data generated from gene expression microarrays requires advanced computational approaches. The microarray gene expression data processing and prediction model is built in four steps: response variable transformation to emphasize the lower and upper extremes (related to Radiosensitive and Radioresistant cell lines); dimensionality reduction to select candidate gene expression probesets; model development using a Random Forest algorithm; and validation of the model in two clinical cohorts for colorectal and esophagus cancer patients. Subjective human decision-making plays a significant role in defining the treatment strategy. Thus, the decision model developed in this research uses language and mechanisms suitable for human interpretation and understanding through fuzzy sets and degree of membership. This treatment selection strategy is modeled using a fuzzy logic framework to account for the subjectivity associated to the medical strategy and the patient's characteristics and preferences. The decision model considers criteria associated to survival rate, adverse events and efficacy (measured by radiosensitivity) for treatment recommendation. Finally, a sensitive analysis evaluates the impact of introducing radiosensitivity in the decision-making process. The intellectual merit of this research stems from the fact that it advances the science of decision-making by integrating concepts from the fields of artificial intelligence, medicine, biology and biostatistics to develop a decision aid approach that considers conflictive objectives and has a high practical value. The model focuses on criteria relevant to cancer treatment selection but it can be modified and extended to other scenarios beyond the healthcare environment.

Styles APA, Harvard, Vancouver, ISO, etc.

14

Santos, Daniel Filipe Pé-Leve dos. « Plataforma integrada de dados de acidentes de viação para suporte a processos de aprendizagem automática ». Master's thesis, Universidade de Évora, 2022. http://hdl.handle.net/10174/31064.

Texte intégral

Résumé :

Integrated road accident data platform to support machine learning techniques Traffic accidents are one of the most important concerns of the world, since they result in numerous casualties, injuries, and fatalities each year, as well as significant economic losses. There are many factors that are responsible for causing road accidents. If these factors can be better understood and predicted, it might be possible to take measures to mitigate the damages and its severity. The purpose of this dissertation is to identify these factors using accident data from 2016 to 2019 from the district of Setúbal, Portugal. This work aims at developing models that can select a set of influential factors that may be used to classify the severity of an accident, supporting an analysis on the accident data. In addition, this study also proposes a predictive model for future road accidents based on past data. Various machine learning approaches are used to create these models. Supervised machine learning methods such as decision trees (DT), random forests (RF), logistic regression (LR) and naive bayes (NB) are used, as well as unsupervised machine learning techniques including DBSCAN and hierarchical clustering. Results show that a rule-based model using C5.0 algorithm is capable of accurately detecting the most relevant factors describing a road accident severity. Furthermore, the results of the predictive model suggests the RF model could be a useful tool for forecasting accident hotspots; Sumário: Os acidentes de trânsito são uma grande preocupação a nível mundial, uma vez que resultam em grandes números de vítimas, feridos e mortes por ano, como também perdas económicas significativas. Existem vários fatores responsáveis por causar acidentes rodoviários. Se pudermos compreender e prever melhor estes fatores, talvez seja possível tomar medidas para mitigar os danos e a sua gravidade. O objetivo desta dissertação é identificar estes fatores utilizando dados de acidentes de 2016 a 2019 do distrito de Setúbal, Portugal. Este trabalho tem como objetivo desenvolver modelos capazes de selecionar um conjunto de fatores influentes e que possam vir a ser utilizados para classificar a gravidade de um acidente, suportando uma análise aos dados de acidentes. Além disso, este estudo também propõe um modelo de previsão para futuros acidentes rodoviários com base em dados do passado. Várias abordagens de aprendizagem automática são usadas para criar esses modelos. Métodos de aprendizagem supervisionada, como árvores de decisão (DT), random forest (RF), regressão logística (LR) e naive bayes (NB), são usados, bem como técnicas de aprendizagem automática não supervisionada, incluindo DBSCAN e clustering hierárquico. Os resultados mostram que um modelo baseado em regras usando o algoritmo C5.0 é capaz de detetar com precisão os fatores mais relevantes que descrevem a gravidade de um acidente de viação. Além disso, os resultados do modelo preditivo sugerem que o modelo RF pode ser uma ferramenta útil para a previsão de acidentes.

Styles APA, Harvard, Vancouver, ISO, etc.

15

Wright, Lindsey. « Classifying textual fast food restaurant reviews quantitatively using text mining and supervised machine learning algorithms ». Digital Commons @ East Tennessee State University, 2018. https://dc.etsu.edu/honors/451.

Texte intégral

Résumé :

Companies continually seek to improve their business model through feedback and customer satisfaction surveys. Social media provides additional opportunities for this advanced exploration into the mind of the customer. By extracting customer feedback from social media platforms, companies may increase the sample size of their feedback and remove bias often found in questionnaires, resulting in better informed decision making. However, simply using personnel to analyze the thousands of relative social media content is financially expensive and time consuming. Thus, our study aims to establish a method to extract business intelligence from social media content by structuralizing opinionated textual data using text mining and classifying these reviews by the degree of customer satisfaction. By quantifying textual reviews, companies may perform statistical analysis to extract insight from the data as well as effectively address concerns. Specifically, we analyzed a subset of 56,000 Yelp reviews on fast food restaurants and attempt to predict a quantitative value reflecting the overall opinion of each review. We compare the use of two different predictive modeling techniques, bagged Decision Trees and Random Forest Classifiers. In order to simplify the problem, we train our model to accurately classify strongly negative and strongly positive reviews (1 and 5 stars) reviews. In addition, we identify drivers behind strongly positive or negative reviews allowing businesses to understand their strengths and weaknesses. This method provides companies an efficient and cost-effective method to process and understand customer satisfaction as it is discussed on social media.

Styles APA, Harvard, Vancouver, ISO, etc.

16

Karlsson, Daniel, et Alex Lindström. « Automated Learning and Decision : Making of a Smart Home System ». Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-234313.

Texte intégral

Résumé :

Smart homes are custom-fitted systems for users to manage their home environments. Smart homes consist of devices which has the possibility to communicate between each other. In a smart home system, the communication is used by a central control unit to manage the environment and the devices in it. Setting up a smart home today involves a lot of manual customizations to make it function as the user wishes. What smart homes lack is the possibility to learn from users behaviour and habits in order to provide a customized environment for the user autonomously. The purpose of this thesis is to examine whether environmental data can be collected and used in a small smart home system to learn about the users behaviour. To collect data and attempt this learning process, a system is set up. The system uses a central control unit for mediation between wireless electrical outlets and sensors. The sensors track motion, light, temperature as well as humidity. The devices and sensors along with user interactions in the environment make up the collected data. Through studying the collected data, the system is able to create rules. These rules are used for the system to make decisions within its environment to suit the users’ needs. The performance of the system varies depending on how the data collection is handled. Results find that collecting data in intervals as well as when an action is made from the user is important.
Smarta hem är system avsedda för att hjälpa användare styra sin hemmiljö. Ett smart hem är uppbyggt av enheter med möjlighet att kommunicera med varandra. För att kontrollera enheterna i ett smart hem, används en central styrenhet. Att få ett smart hem att vara anpassat till användare är ansträngande och tidskrävande. Smarta hemsystem saknar i stor utsträckning möjligheten att lära sig av användarens beteende. Vad ett sådant lärande skulle kunna möjliggöra är ett skräddarsytt system utan användarens involvering. Syftet med denna avhandling är att undersöka hur användardata från en hemmiljö kan användas i ett smart hemsystem för att lära sig av användarens beteende. Ett litet smart hemsystem har skapats för att studera ifall denna inlärningsmetod är applicerbar. Systemet består av sensorer, trådlösa eluttag och en central styrenhet. Den centrala styrenheten används för att kontrollera de olika enheterna i miljön. Sensordata som sparas av systemet består av rörelse, ljusstyrka, temperatur och luftfuktighet. Systemet sparar även användarens beteende i miljön. Systemet skapar regler utifrån sparad data med målet att kunna styra enheterna i miljön på ett sätt som passar användaren. Systemets agerande varierade beroende på hur data samlades in. Resultatet visar vikten av att samla in data både i intervaller och när användare tar ett beslut i miljön.

Styles APA, Harvard, Vancouver, ISO, etc.

17

Revend, War. « Predicting House Prices on the Countryside using Boosted Decision Trees ». Thesis, KTH, Matematisk statistik, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-279849.

Texte intégral

Résumé :

This thesis intends to evaluate the feasibility of supervised learning models for predicting house prices on the countryside of South Sweden. It is essential for mortgage lenders to have accurate housing valuation algorithms and the current model offered by Booli is not accurate enough when evaluating residence prices on the countryside. Different types of boosted decision trees were implemented to address this issue and their performances were compared to traditional machine learning methods. These different types of supervised learning models were implemented in order to find the best model with regards to relevant evaluation metrics such as root-mean-squared error (RMSE) and mean absolute percentage error (MAPE). The implemented models were ridge regression, lasso regression, random forest, AdaBoost, gradient boosting, CatBoost, XGBoost, and LightGBM. All these models were benchmarked against Booli's current housing valuation algorithms which are based on a k-NN model. The results from this thesis indicated that the LightGBM model is the optimal one as it had the best overall performance with respect to the chosen evaluation metrics. When comparing the LightGBM model to the benchmark, the performance was overall better, the LightGBM model had an RMSE score of 0.330 compared to 0.358 for the Booli model, indicating that there is a potential of using boosted decision trees to improve the predictive accuracy of residence prices on the countryside.
Denna uppsats ämnar utvärdera genomförbarheten hos olika övervakade inlärningsmodeller för att förutse huspriser på landsbygden i Södra Sverige. Det är viktigt för bostadslånsgivare att ha noggranna algoritmer när de värderar bostäder, den nuvarande modellen som Booli erbjuder har dålig precision när det gäller värderingar av bostäder på landsbygden. Olika typer av boostade beslutsträd implementerades för att ta itu med denna fråga och deras prestanda jämfördes med traditionella maskininlärningsmetoder. Dessa olika typer av övervakad inlärningsmodeller implementerades för att hitta den bästa modellen med avseende på relevanta prestationsmått som t.ex. root-mean-squared error (RMSE) och mean absolute percentage error (MAPE). De övervakade inlärningsmodellerna var ridge regression, lasso regression, random forest, AdaBoost, gradient boosting, CatBoost, XGBoost, and LightGBM. Samtliga algoritmers prestanda jämförs med Boolis nuvarande bostadsvärderingsalgoritm, som är baserade på en k-NN modell. Resultatet från denna uppsats visar att LightGBM modellen är den optimala modellen för att värdera husen på landsbygden eftersom den hade den bästa totala prestandan med avseende på de utvalda utvärderingsmetoderna. LightGBM modellen jämfördes med Booli modellen där prestandan av LightGBM modellen var i överlag bättre, där LightGBM modellen hade ett RMSE värde på 0.330 jämfört med Booli modellen som hade ett RMSE värde på 0.358. Vilket indikerar att det finns en potential att använda boostade beslutsträd för att förbättra noggrannheten i förutsägelserna av huspriser på landsbygden.

Styles APA, Harvard, Vancouver, ISO, etc.

18

Assareh, Amin. « OPTIMIZING DECISION TREE ENSEMBLES FOR GENE-GENE INTERACTION DETECTION ». Kent State University / OhioLINK, 2012. http://rave.ohiolink.edu/etdc/view?acc_num=kent1353971575.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

19

Ciss, Saïp. « Forêts uniformément aléatoires et détection des irrégularités aux cotisations sociales ». Thesis, Paris 10, 2014. http://www.theses.fr/2014PA100063/document.

Texte intégral

Résumé :

Nous présentons dans cette thèse une application de l'apprentissage statistique à la détection des irrégularités aux cotisations sociales. L'apprentissage statistique a pour but de modéliser des problèmes dans lesquels il existe une relation, généralement non déterministe, entre des variables et le phénomène que l'on cherche à évaluer. Un aspect essentiel de cette modélisation est la prédiction des occurrences inconnues du phénomène, à partir des données déjà observées. Dans le cas des cotisations sociales, la représentation du problème s'exprime par le postulat de l'existence d'une relation entre les déclarations de cotisation des entreprises et les contrôles effectués par les organismes de recouvrement. Les inspecteurs du contrôle certifient le caractère exact ou inexact d'un certain nombre de déclarations et notifient, le cas échéant, un redressement aux entreprises concernées. L'algorithme d'apprentissage "apprend", grâce à un modèle, la relation entre les déclarations et les résultats des contrôles, puis produit une évaluation de l'ensemble des déclarations non encore contrôlées. La première partie de l'évaluation attribue un caractère régulier ou irrégulier à chaque déclaration, avec une certaine probabilité. La seconde estime les montants de redressement espérés pour chaque déclaration. Au sein de l'URSSAF (Union de Recouvrement des cotisations de Sécurité sociale et d'Allocations Familiales) d'Île-de-France, et dans le cadre d'un contrat CIFRE (Conventions Industrielles de Formation par la Recherche), nous avons développé un modèle de détection des irrégularités aux cotisations sociales que nous présentons et détaillons tout au long de la thèse. L'algorithme fonctionne sous le logiciel libre R. Il est entièrement opérationnel et a été expérimenté en situation réelle durant l'année 2012. Pour garantir ses propriétés et résultats, des outils probabilistes et statistiques sont nécessaires et nous discutons des aspects théoriques ayant accompagné sa conception. Dans la première partie de la thèse, nous effectuons une présentation générale du problème de la détection des irrégularités aux cotisations sociales. Dans la seconde, nous abordons la détection spécifiquement, à travers les données utilisées pour définir et évaluer les irrégularités. En particulier, les seules données disponibles suffisent à modéliser la détection. Nous y présentons également un nouvel algorithme de forêts aléatoires, nommé "forêt uniformément aléatoire", qui constitue le moteur de détection. Dans la troisième partie, nous détaillons les propriétés théoriques des forêts uniformément aléatoires. Dans la quatrième, nous présentons un point de vue économique, lorsque les irrégularités aux cotisations sociales ont un caractère volontaire, cela dans le cadre de la lutte contre le travail dissimulé. En particulier, nous nous intéressons au lien entre la situation financière des entreprises et la fraude aux cotisations sociales. La dernière partie est consacrée aux résultats expérimentaux et réels du modèle, dont nous discutons.Chacun des chapitres de la thèse peut être lu indépendamment des autres et quelques notions sont redondantes afin de faciliter l'exploration du contenu
We present in this thesis an application of machine learning to irregularities in the case of social contributions. These are, in France, all contributions due by employees and companies to the "Sécurité sociale", the french system of social welfare (alternative incomes in case of unemployement, Medicare, pensions, ...). Social contributions are paid by companies to the URSSAF network which in charge to recover them. Our main goal was to build a model that would be able to detect irregularities with a little false positive rate. We, first, begin the thesis by presenting the URSSAF and how irregularities can appear, how can we handle them and what are the data we can use. Then, we talk about a new machine learning algorithm we have developped for, "random uniform forests" (and its R package "randomUniformForest") which are a variant of Breiman "random Forests" (tm), since they share the same principles but in in a different way. We present theorical background of the model and provide several examples. Then, we use it to show, when irregularities are fraud, how financial situation of firms can affect their propensity for fraud. In the last chapter, we provide a full evaluation for declarations of social contributions of all firms in Ile-de-France for year 2013, by using the model to predict if declarations present irregularities or not

Styles APA, Harvard, Vancouver, ISO, etc.

20

Yan, Ping. « Anomaly Detection in Categorical Data with Interpretable Machine Learning : A random forest approach to classify imbalanced data ». Thesis, Linköpings universitet, Statistik och maskininlärning, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-158185.

Texte intégral

Résumé :

Metadata refers to "data about data", which contains information needed to understand theprocess of data collection. In this thesis, we investigate if metadata features can be usedto detect broken data and how a tree-based interpretable machine learning algorithm canbe used for an effective classification. The goal of this thesis is two-fold. Firstly, we applya classification schema using metadata features for detecting broken data. Secondly, wegenerate the feature importance rate to understand the model’s logic and reveal the keyfactors that lead to broken data. The given task from the Swedish automotive company Veoneer is a typical problem oflearning from extremely imbalanced data set, with 97 percent of data belongs healthy dataand only 3 percent of data belongs to broken data. Furthermore, the whole data set containsonly categorical variables in nominal scales, which brings challenges to the learningalgorithm. The notion of handling imbalanced problem for continuous data is relativelywell-studied, but for categorical data, the solution is not straightforward. In this thesis, we propose a combination of tree-based supervised learning and hyperparametertuning to identify the broken data from a large data set. Our methods arecomposed of three phases: data cleaning, which is eliminating ambiguous and redundantinstances, followed by the supervised learning algorithm with random forest, lastly, weapplied a random search for hyper-parameter optimization on random forest model. Our results show empirically that tree-based ensemble method together with a randomsearch for hyper-parameter optimization have made improvement to random forest performancein terms of the area under the ROC. The model outperformed an acceptableclassification result and showed that metadata features are capable of detecting brokendata and providing an interpretable result by identifying the key features for classificationmodel.

Styles APA, Harvard, Vancouver, ISO, etc.

21

Lundström, Love, et Oscar Öhman. « Machine Learning in credit risk : Evaluation of supervised machine learning models predicting credit risk in the financial sector ». Thesis, Umeå universitet, Institutionen för matematik och matematisk statistik, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-164101.

Texte intégral

Résumé :

When banks lend money to another party they face a risk that the borrower will not fulfill its obligation towards the bank. This risk is called credit risk and it’s the largest risk banks faces. According to the Basel accord banks need to have a certain amount of capital requirements to protect themselves towards future financial crisis. This amount is calculated for each loan with an attached risk-weighted asset, RWA. The main parameters in RWA is probability of default and loss given default. Banks are today allowed to use their own internal models to calculate these parameters. Thus hold capital with no gained interest is a great cost, banks seek to find tools to better predict probability of default to lower the capital requirement. Machine learning and supervised algorithms such as Logistic regression, Neural network, Decision tree and Random Forest can be used to decide credit risk. By training algorithms on historical data with known results the parameter probability of default (PD) can be determined with a higher certainty degree compared to traditional models, leading to a lower capital requirement. On the given data set in this article Logistic regression seems to be the algorithm with highest accuracy of classifying customer into right category. However, it classifies a lot of people as false positive meaning the model thinks a customer will honour its obligation but in fact the customer defaults. Doing this comes with a great cost for the banks. Through implementing a cost function to minimize this error, we found that the Neural network has the lowest false positive rate and will therefore be the model that is best suited for this specific classification task.
När banker lånar ut pengar till en annan part uppstår en risk i att låntagaren inte uppfyller sitt antagande mot banken. Denna risk kallas för kredit risk och är den största risken en bank står inför. Enligt Basel föreskrifterna måste en bank avsätta en viss summa kapital för varje lån de ger ut för att på så sätt skydda sig emot framtida finansiella kriser. Denna summa beräknas fram utifrån varje enskilt lån med tillhörande risk-vikt, RWA. De huvudsakliga parametrarna i RWA är sannolikheten att en kund ej kan betala tillbaka lånet samt summan som banken då förlorar. Idag kan banker använda sig av interna modeller för att estimera dessa parametrar. Då bundet kapital medför stora kostnader för banker, försöker de sträva efter att hitta bättre verktyg för att uppskatta sannolikheten att en kund fallerar för att på så sätt minska deras kapitalkrav. Därför har nu banker börjat titta på möjligheten att använda sig av maskininlärningsalgoritmer för att estimera dessa parametrar. Maskininlärningsalgoritmer såsom Logistisk regression, Neurala nätverk, Beslutsträd och Random forest, kan användas för att bestämma kreditrisk. Genom att träna algoritmer på historisk data med kända resultat kan parametern, chansen att en kund ej betalar tillbaka lånet (PD), bestämmas med en högre säkerhet än traditionella metoder. På den givna datan som denna uppsats bygger på visar det sig att Logistisk regression är den algoritm med högst träffsäkerhet att klassificera en kund till rätt kategori. Däremot klassifiserar denna algoritm många kunder som falsk positiv vilket betyder att den predikterar att många kunder kommer betala tillbaka sina lån men i själva verket inte betalar tillbaka lånet. Att göra detta medför en stor kostnad för bankerna. Genom att istället utvärdera modellerna med hjälp av att införa en kostnadsfunktion för att minska detta fel finner vi att Neurala nätverk har den lägsta falsk positiv ration och kommer därmed vara den model som är bäst lämpad att utföra just denna specifika klassifierings uppgift.

Styles APA, Harvard, Vancouver, ISO, etc.

22

Rosales, Martínez Octavio. « Caracterización de especies en plasma frío mediante análisis de espectroscopia de emisión óptica por técnicas de Machine Learning ». Tesis de maestría, Universidad Autónoma del Estado de México, 2020. http://hdl.handle.net/20.500.11799/109734.

Texte intégral

Résumé :

La espectroscopía de emisión óptica es una técnica que permite la identificación de elementos químicos usando el espectro electromagnético que emite un plasma. Con base en la literatura. tiene aplicaciones diversas, por ejemplo: en la identificación de entes estelares, para determinar el punto final de los procesos de plasma en la fabricación de semiconductores o bien, específicamente en este trabajo, se tratan espectros para la determinación de elementos presentes en la degradación de compuestos recalcitrantes. En este documento se identifican automáticamente espectros de elementos tales como He, Ar, N, O, y Hg, en sus niveles de energía uno y dos, mediante técnicas de Machine Learning (ML). En primer lugar, se descargan las líneas de elementos reportadas en el NIST (National Institute of Standards and Technology), después se preprocesan y unifican para los siguientes procesos: a) crear un generador de 84 espectros sintéticos implementado en Python y el módulo ipywidgets de Jupyter Notebook, con las posibilidades de elegir un elemento, nivel de energía, variar la temperatura, anchura a media altura, y normalizar el especto y, b) extraer las líneas para los elementos He, Ar, N, O y Hg en el rango de los 200 nm a 890 nm, posteriormente, se les aplica sobremuestreo para realizar la búsqueda de hiperparámetros para los algoritmos: Decision Tree, Bagging, Random Forest y Extremely Randomized Trees basándose en los principios del diseño de experimentos de aleatorización, replicación, bloqueo y estratificación.

Styles APA, Harvard, Vancouver, ISO, etc.

23

Doubleday, Kevin. « Generation of Individualized Treatment Decision Tree Algorithm with Application to Randomized Control Trials and Electronic Medical Record Data ». Thesis, The University of Arizona, 2016. http://hdl.handle.net/10150/613559.

Texte intégral

Résumé :

With new treatments and novel technology available, personalized medicine has become a key topic in the new era of healthcare. Traditional statistical methods for personalized medicine and subgroup identification primarily focus on single treatment or two arm randomized control trials (RCTs). With restricted inclusion and exclusion criteria, data from RCTs may not reflect real world treatment effectiveness. However, electronic medical records (EMR) offers an alternative venue. In this paper, we propose a general framework to identify individualized treatment rule (ITR), which connects the subgroup identification methods and ITR. It is applicable to both RCT and EMR data. Given the large scale of EMR datasets, we develop a recursive partitioning algorithm to solve the problem (ITR-Tree). A variable importance measure is also developed for personalized medicine using random forest. We demonstrate our method through simulations, and apply ITR-Tree to datasets from diabetes studies using both RCT and EMR data. Software package is available at https://github.com/jinjinzhou/ITR.Tree.

Styles APA, Harvard, Vancouver, ISO, etc.

24

Dinger, Steven. « Essays on Reinforcement Learning with Decision Trees and Accelerated Boosting of Partially Linear Additive Models ». University of Cincinnati / OhioLINK, 2019. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1562923541849035.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

25

Bitara, Matúš. « Srovnání heuristických a konvenčních statistických metod v data miningu ». Master's thesis, Vysoké učení technické v Brně. Fakulta strojního inženýrství, 2019. http://www.nusl.cz/ntk/nusl-400833.

Texte intégral

Résumé :

The thesis deals with the comparison of conventional and heuristic methods in data mining used for binary classification. In the theoretical part, four different models are described. Model classification is demonstrated on simple examples. In the practical part, models are compared on real data. This part also consists of data cleaning, outliers removal, two different transformations and dimension reduction. In the last part methods used to quality testing of models are described.

Styles APA, Harvard, Vancouver, ISO, etc.

26

Mistry, Pritesh. « A Knowledge Based Approach of Toxicity Prediction for Drug Formulation. Modelling Drug Vehicle Relationships Using Soft Computing Techniques ». Thesis, University of Bradford, 2015. http://hdl.handle.net/10454/14440.

Texte intégral

Résumé :

This multidisciplinary thesis is concerned with the prediction of drug formulations for the reduction of drug toxicity. Both scientific and computational approaches are utilised to make original contributions to the field of predictive toxicology. The first part of this thesis provides a detailed scientific discussion on all aspects of drug formulation and toxicity. Discussions are focused around the principal mechanisms of drug toxicity and how drug toxicity is studied and reported in the literature. Furthermore, a review of the current technologies available for formulating drugs for toxicity reduction is provided. Examples of studies reported in the literature that have used these technologies to reduce drug toxicity are also reported. The thesis also provides an overview of the computational approaches currently employed in the field of in silico predictive toxicology. This overview focuses on the machine learning approaches used to build predictive QSAR classification models, with examples discovered from the literature provided. Two methodologies have been developed as part of the main work of this thesis. The first is focused on use of directed bipartite graphs and Venn diagrams for the visualisation and extraction of drug-vehicle relationships from large un-curated datasets which show changes in the patterns of toxicity. These relationships can be rapidly extracted and visualised using the methodology proposed in chapter 4. The second methodology proposed, involves mining large datasets for the extraction of drug-vehicle toxicity data. The methodology uses an area-under-the-curve principle to make pairwise comparisons of vehicles which are classified according to the toxicity protection they offer, from which predictive classification models based on random forests and decisions trees are built. The results of this methodology are reported in chapter 6.

Styles APA, Harvard, Vancouver, ISO, etc.

27

Gomes, Alexandre Miguel Gonçalves. « Aplicação de machine learning no combate ao branqueamento de capitais e ao financiamento do terrorismo ». Master's thesis, Instituto Superior de Economia e Gestão, 2019. http://hdl.handle.net/10400.5/19977.

Texte intégral

Résumé :

Mestrado em Métodos Quantitativos para a Decisão Económica e Empresarial
Este trabalho resulta de um estágio desenvolvido na Empresa Quidgest, S.A. O trabalho final de mestrado versa sobre uma aplicação de Machine Learning na resolução da problemática de combate ao branqueamento de capitais e ao financiamento do terrorismo. Tal problema é conhecido como um caso de dados desbalanceados. Por conseguinte, a questão é abordada no decorrer do trabalho, apresentando várias formas de resolução. São ainda tratados os conceitos Machine Learning, Data Mining e Knowledge-Discovery in Databases. No âmbito do Machine Learning, o presente trabalho apenas se debruça sobre algoritmos supervisionados. Mais especificamente, os classificadores Random Forest, Adaboost e Boosting C5.0. Tais métodos foram aplicados sobre um repositório de dados que se encontravam alojados no sistema de gestão de base de dados Microsoft SQL Server. A investigação seguiu a metodologia CRISP-DM e teve a sua implementação no software R.
This work results from an internship developed at Quidgest, S.A. This Master Final Work deals with an application of the Machine Learning in order to solve the problem of money laundering and the financing of terrorism. This problem is known as a case of unbalanced data. Therefore, the issue is addressed in the course of the work, presenting various forms of resolution. The concepts of Machine Learning, Data Mining and Knowledge-Discovery in Databases are also discussed. In Machine Learning, this paper only focuses on supervised algorithms. More specifically, the classifiers: Random Forest, Adaboost, and Boosting C5.0. These methods were applied to a data repository that was hosted in Microsoft SQL Server database management system. The research followed the CRISP-DM methodology and was implemented in the R software.
info:eu-repo/semantics/publishedVersion

Styles APA, Harvard, Vancouver, ISO, etc.

28

Stříteský, Radek. « Sémantické rozpoznávání komentářů na webu ». Master's thesis, Vysoké učení technické v Brně. Fakulta elektrotechniky a komunikačních technologií, 2017. http://www.nusl.cz/ntk/nusl-317212.

Texte intégral

Résumé :

The main goal of this paper is the identification of comments on internet websites. The theoretical part is focused on artificial intelligence, mainly classifiers are described there. The practical part deals with creation of training database, which is formed by using generators of features. A generated feature might be for example a title of the HTML element where the comment is. The training database is created by input of classifiers. The result of this paper is testing classifiers in the RapidMiner program.

Styles APA, Harvard, Vancouver, ISO, etc.

29

Anchelía, Carhuaricra Danny Raúl, et Sáenz Ximena Nicole Mori. « Determinación de zonas susceptibles a inundaciones y análisis comparativo del Proceso de Análisis Jerárquico (AHP) y Random Forest (RF). Caso estudio : cuenca baja del río Chancay Lambayeque ». Bachelor's thesis, Universidad Nacional Mayor de San Marcos, 2020. https://hdl.handle.net/20.500.12672/15868.

Texte intégral

Résumé :

Las inundaciones son uno de los principales fenómenos naturales que acontecen en el Perú, especialmente en las cuencas ubicadas en el noroeste del país, siendo provocados por precipitaciones extremas, ocasionando daños humanos y económicos. Por esta razón, el desarrollo de modelos para identificar zonas susceptibles a inundaciones es esencial para los tomadores de decisiones. A partir de ello, la presente investigación tiene como objetivo realizar el análisis comparativo entre el modelo generado por el Proceso de Análisis Jerárquico (AHP), con el Random Forest a fin de establecer el método más adecuado para la determinación de zonas susceptibles a inundaciones en la cuenca baja del río Chancay – Lambayeque. Se consideraron seis factores, entre los condicionantes y el desencadenante: geología, suelos, uso actual de suelos, Distancia al río, pendiente y precipitación. Estos factores se configuraron como dataset ráster a nivel del área de estudio con una resolución espacial de 30m x 30m para la aplicación en ambos métodos. A su vez se utilizó un inventario de inundaciones que fue generado a partir de datos históricos sobre eventos de inundación obtenidas de instituciones gubernamentales, trabajo de campo e interpretación de las imágenes satelitales Sentinel-2 registradas en 2017, de donde el 70% del total se utilizó como conjunto de entrenamiento para el modelo Random Forest, mientras que el 30% restante se aplicó para la validación de ambos modelos. En consecuencia, se obtuvo los mapas de susceptibilidad a través de ambos modelos. Mediante el área bajo la curva ROC, se probó el poder predictivo de cada uno de ellos; donde los resultados demostraron que el método Random Forest fue más eficiente para la determinación de susceptibilidad ante inundaciones al tener una tasa de predicción de 0,9941 a diferencia del método AHP que resultó con un valor de 0,9774.

Styles APA, Harvard, Vancouver, ISO, etc.

30

Velka, Elina. « Loss Given Default Estimation with Machine Learning Ensemble Methods ». Thesis, KTH, Matematisk statistik, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-279846.

Texte intégral

Résumé :

This thesis evaluates the performance of three machine learning methods in prediction of the Loss Given Default (LGD). LGD can be seen as the opposite of the recovery rate, i.e. the ratio of an outstanding loan that the loan issuer would not be able to recover in case the customer would default. The methods investigated are decision trees, random forest and boosted methods. All of the methods investigated performed well in predicting the cases were the loan is not recovered, LGD = 1 (100%), or the loan is totally recovered, LGD = 0 (0% ). When the performance of the models was evaluated on a dataset where the observations with LGD = 1 were removed, a significant decrease in performance was observed. The random forest model built on an unbalanced training dataset showed better performance on the test dataset that included values LGD = 1 and the random forest model built on a balanced training dataset performed better on the test set where the observations of LGD = 1 were removed. Boosted models evaluated in this study showed less accurate predictions than other methods used. Overall, the performance of random forest models showed slightly better results than the performance of decision tree models, although the computational time (the cost) was considerably longer when running the random forest models. Therefore decision tree models would be suggested for prediction of the Loss Given Default.
Denna uppsats undersöker och jämför tre maskininlärningsmetoder som estimerar förlust vid fallissemang (Loss Given Default, LGD). LGD kan ses som motsatsen till återhämtningsgrad, dvs. andelen av det utstående lånet som långivaren inte skulle återfå ifall kunden skulle fallera. Maskininlärningsmetoder som undersöks i detta arbete är decision trees, random forest och boosted metoder. Alla metoder fungerade väl vid estimering av lån som antingen inte återbetalas, dvs. LGD = 1 (100%), eller av lån som betalas i sin helhet, LGD = 0 (0%). En tydlig minskning i modellernas träffsäkerhet påvisades när modellerna kördes med ett dataset där observationer med LGD = 1 var borttagna. Random forest modeller byggda på ett obalanserat träningsdataset presterade bättre än de övriga modellerna på testset som inkluderade observationer där LGD = 1. Då observationer med LGD = 1 var borttagna visade det sig att random forest modeller byggda på ett balanserat träningsdataset presterade bättre än de övriga modellerna. Boosted modeller visade den svagaste träffsäkerheten av de tre metoderna som blev undersökta i denna studie. Totalt sett visade studien att random forest modeller byggda på ett obalanserat träningsdataset presterade en aning bättre än decision tree modeller, men beräkningstiden (kostnaden) var betydligt längre när random forest modeller kördes. Därför skulle decision tree modeller föredras vid estimering av förlust vid fallissemang.

Styles APA, Harvard, Vancouver, ISO, etc.

31

Konečný, Antonín. « Využití umělé inteligence v technické diagnostice ». Master's thesis, Vysoké učení technické v Brně. Fakulta strojního inženýrství, 2021. http://www.nusl.cz/ntk/nusl-443221.

Texte intégral

Résumé :

The diploma thesis is focused on the use of artificial intelligence methods for evaluating the fault condition of machinery. The evaluated data are from a vibrodiagnostic model for simulation of static and dynamic unbalances. The machine learning methods are applied, specifically supervised learning. The thesis describes the Spyder software environment, its alternatives, and the Python programming language, in which the scripts are written. It contains an overview with a description of the libraries (Scikit-learn, SciPy, Pandas ...) and methods — K-Nearest Neighbors (KNN), Support Vector Machines (SVM), Decision Trees (DT) and Random Forests Classifiers (RF). The results of the classification are visualized in the confusion matrix for each method. The appendix includes written scripts for feature engineering, hyperparameter tuning, evaluation of learning success and classification with visualization of the result.

Styles APA, Harvard, Vancouver, ISO, etc.

32

Malmberg, Olle, et Bobby Zhou. « Using Machine Learning to Detect Customer Acquisition Opportunities and Evaluating the Required Organizational Prerequisites ». Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-263056.

Texte intégral

Résumé :

This paper aims to investigate whether or not it is possible to identify users who are about change provider of service with machine learning. It is believed that the Consumer Decision Journey is a better model than traditional funnel models when it comes to depicting the processes which consumers go through, leading up to a purchase. Analytical and operational Customer Relationship Management are presented as possible fields where such implementations can be useful. Based on previous studies, Random Forest and XGBoost were chosen as algorithms to be further evaluated because of its general high performance. The final results were produced by an iterative process which began with data processing followed by feature selection, training of model and testing the model. Literature review and unstructured and semi-structured interviews with the employer Growth Hackers Sthlm were also used as methods in a complementary fashion, with the purpose of gaining a wider perspective of the state-of-the-art of ML-implementations. The final results showed that Random Forest could identify the sought-after users (positive) while XGBoost was inferior to Random Forest in terms of distinguishing between positive and negative classes. An implementation of such model could support and benefit an organization’s customer acquisition operations. However, organizational prerequisites regarding the data infrastructure and the level of AI and machine learning integration in the organization’s culture are the most important ones and need to be considered before such implementations.
I det här arbetet undersöks huruvida det är möjligt att identifiera ett beteende bland användare som innebär att användaren snart ska byta tillhandahållare av tjänst med hjälp av maskininlärning. Målet är att kunna bidra till ett maskininlärningsverktyg i kundförvärvningssyfte, såsom analytical och operational Customer Relationship Management. Det sökta beteendet i rapporten utgår från modellen ”the Consumer Decision Journey”. I modellen beskrivs fyra faser där fas två innebär att konsumenten aktivt söker samt är mer mottaglig för information kring köpet. Genom tidigare studier och handledning av uppdragsgivare valdes algoritmerna RandomForest och XGBoost som huvudsakliga algoritmer som skulle testas. Resultaten producerades genom en iterativ process. Det första steget var att städa data. Därefter valdes parametrar och viktades. Sedan testades algoritmerna mot testdata och utvärderades. Detta gjordes i loopar tills förbättringar endast var marginella. De slutliga resultaten visade att framförallt Random Forest kunde identifiera ett beteende som innebär att en användare är i fas 2, medan XGBoost presterade sämre när det kom till att urskilja bland positiva och negativa användare. Dock fångade XGBoost fler positiva användare än vad Random Forest gjorde. I syfte att undersöka de organisatoriska förutsättningarna för att implementera maskininlärning och AI gjordes litteraturstudier och uppdragsgivaren intervjuades kontinuerligt. De viktigaste förutsättningarna fastställdes till två kategorier, datainfrastruktur och hur väl AI och maskininlärning är integrerat i organisationens kultur.

Styles APA, Harvard, Vancouver, ISO, etc.

33

Park, Samuel M. « A Comparison of Machine Learning Techniques to Predict University Rates ». University of Toledo / OhioLINK, 2019. http://rave.ohiolink.edu/etdc/view?acc_num=toledo1564790014887692.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

34

Fürderer, Niklas. « A Study of an Iterative User-Specific Human Activity Classification Approach ». Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-253802.

Texte intégral

Résumé :

Applications for sensor-based human activity recognition use the latest algorithms for the detection and classiﬁcation of human everyday activities, both for online and ofﬂine use cases. The insights generated by those algorithms can in a next step be used within a wide broad of applications such as safety, ﬁtness tracking, localization, personalized health advice and improved child and elderly care.In order for an algorithm to be performant, a signiﬁcant amount of annotated data from a speciﬁc target audience is required. However, a satisfying data collection process is cost and labor intensive. This also may be unfeasible for speciﬁc target groups as aging effects motion patterns and behaviors. One main challenge in this application area lies in the ability to identify relevant changes over time while being able to reuse previously annotated user data. The accurate detection of those user-speciﬁc patterns and movement behaviors therefore requires individual and adaptive classiﬁcation models for human activities.The goal of this degree work is to compare several supervised classiﬁer performances when trained and tested on a newly iterative user-speciﬁc human activity classiﬁcation approach as described in this report. A qualitative and quantitative data collection process was applied. The tree-based classiﬁcation algorithms Decision Tree, Random Forest as well as XGBoost were tested on custom based datasets divided into three groups. The datasets contained labeled motion data of 21 volunteers from wrist worn sensors.Computed across all datasets, the average performance measured in recall increased by 5.2% (using a simulated leave-one-subject-out cross evaluation) for algorithms trained via the described approach compared to a random non-iterative approach.
Sensorbaserad aktivitetsigenkänning använder sig av det senaste algoritmerna för detektion och klassiﬁcering av mänskliga vardagliga aktiviteter, både i uppoch frånkopplat läge. De insikter som genereras av algoritmerna kan i ett nästa steg användas inom en mängd nya applikationer inom områden så som säkerhet, träningmonitorering, platsangivelser, personiﬁerade hälsoråd samt inom barnoch äldreomsorgen.För att en algoritm skall uppnå hög prestanda krävs en inte obetydlig mängd annoterad data, som med fördel härrör från den avsedda målgruppen. Dock är datainsamlingsprocessen kostnadsoch arbetsintensiv. Den kan dessutom även vara orimlig att genomföra för vissa speciﬁka målgrupper, då åldrandet påverkar rörelsemönster och beteenden. En av de största utmaningarna inom detta område är att hitta de relevanta förändringar som sker över tid, samtidigt som man vill återanvända tidigare annoterad data. För att kunna skapa en korrekt bild av det individuella rörelsemönstret behövs därför individuella och adaptiva klassiﬁceringsmodeller.Målet med detta examensarbete är att jämföra ﬂera olika övervakade klassiﬁcerares (eng. supervised classiﬁers) prestanda när dem tränats med hjälp av ett iterativt användarspeciﬁkt aktivitetsklassiﬁceringsmetod, som beskrivs i denna rapport. En kvalitativ och kvantitativ datainsamlingsprocess tillämpades. Trädbaserade klassiﬁceringsalgoritmerna Decision Tree, Random Forest samt XGBoost testades utifrån speciﬁkt skapade dataset baserade på 21 volontärer, som delades in i tre grupper. Data är baserad på rörelsedata från armbandssensorer.Beräknat över samtlig data, ökade den genomsnittliga sensitiviteten med 5.2% (simulerad korsvalidering genom utelämna-en-individ) för algoritmer tränade via beskrivna metoden jämfört med slumpvis icke-iterativ träning.

Styles APA, Harvard, Vancouver, ISO, etc.

35

Aguilar, Vilca Dennys, et Ramos Julio Cesar Camargo. « Sistema inteligente basado en redes neuronales, máquina de soporte vectorial y random forest para la predicción de deserción de clientes en microcréditos de bancos ». Bachelor's thesis, Universidad Nacional Mayor de San Marcos, 2021. https://hdl.handle.net/20.500.12672/16390.

Texte intégral

Résumé :

La deserción de clientes bancarios es un problema que afecta actualmente a las empresas de todos los sectores y en todos los países. Por su parte, el sector financiero es uno de los más importantes debido a la gran cantidad de clientes y dinero que estos aportan. Las empresas invierten dinero para realizar un seguimiento a los clientes y poder identificar patrones que puedan evidenciar si un cliente va a dejar de hacer negocios con la empresa, pero muchas veces las maneras manuales de realizarlas presentan deficiencias de tiempo y de pérdida de dinero. En la literatura es común ver modelos de predicción de deserción de clientes bancarios microcréditos, el punto débil de estos es que solo aplican una técnica para realizar propiamente la predicción. En virtud de esto, se propone un sistema inteligente basado en un modelo híbrido que combina tres técnicas para proporcionar mejor precisión que la observada en la literatura; estas son Máquinas de Soporte Vectorial, Redes Neuronales y Random Forest. Los resultados numéricos obtenidos del experimento realizado a un banco peruano con un conjunto de datos de 24 420 clientes presentan una precisión de 97.38%, el cual mejora los resultados de la literatura.

Styles APA, Harvard, Vancouver, ISO, etc.

36

Jacobsson, Marcus, et Viktor Inkapööl. « Prediktion av optimal tidpunkt för köp av flygbiljetter med hjälp av maskininlärning ». Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-281767.

Texte intégral

Résumé :

The work presented in this study is based on the desire of cutting consumer costs related to purchase of airfare tickets. In detail, the study has investigated whether it is possible to classify optimal purchase decisions for specific flight routes with high accuracy using machine learning models trained with basic data containing only price and search date for a given date of departure. The models were based on Random Forest Classifier and trained on search data up to 90 days ahead of every leave date in July 2016-2018, and tested on the same kind of data for 2019. After preparation of data and tuning of hyperparameters the final models managed to correctly classify optimal purchase with an accuracy of 88% for the trip Stockholm-Mallorca and 84% for the trip Stockholm-Bangkok. Based on the assumption that the number of searches correlates with demand and in turn actual purchases, the study calculated the average expected savings per ticket using the model on the specific routes to be 21% and 17% respectively. Furthermore, the study has also examined how a business model for price comparison could be reshaped to incorporate these findings. The framework was set up using Business Model Canvas and resulted in the recommendation of implementing a premium service where users would be given the information wether to buy or wait based on a search.
Arbetet presenterat i studien är baserat på målet att sänka konsumentkostnader relaterat till köp av flygresor. Mer specifikt har studien undersökt huruvida det är möjligt att predicera optimala köpbeslut för specifika flygrutter med hjälp av maskininlärningsmodeller tränade på grundläggande data innehållande endast information om pris och sökdatum för varje givet avresedatum. Modellerna baserades på Random Forest Classifier och tränades på sökdata upp till 90 dagar före avresa för varje avresedag i juli 2016–2018, och testades på likadan data för 2019. Efter förberedelse av data och tuning av hyperparametrar lyckades modellerna med en träffsäkerhet på 88% respektive 84% predicera optimalt köp för rutterna Stockholm-Mallorca respektive Stockholm-Bangkok. Baserat på antagande om att antalet sökningar korrelerar med efterfrågan och vidare faktiska köp, beräknade studien att den genomsnittliga förväntade besparingen per biljett vid användning av modeller på de undersökta rutterna till 21% respektive 17%. Vidare undersökte studien hur en affärsmodell för prisjämförelse kan omformas för att inkorporera resultaten. Ramverkat som användes för detta var Business Model Canvas och mynnade ut i en rekommendation av implementering av en premiumtjänst genom vilken användare ges information biljett ska köpas eller ej vid en given sökning.

Styles APA, Harvard, Vancouver, ISO, etc.

37

Yang, Kaolee. « A Statistical Analysis of Medical Data for Breast Cancer and Chronic Kidney Disease ». Bowling Green State University / OhioLINK, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=bgsu1587052897029939.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

38

Paul, Somak. « Effect of Supply Chain Uncertainties on Inventory and Fulfillment Decision Making : An Empirical Investigation ». The Ohio State University, 2019. http://rave.ohiolink.edu/etdc/view?acc_num=osu1563510590703363.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

39

Fredriksson, Tomas, et Rickard Svensson. « Analysis of machine learning for human motion pattern recognition on embedded devices ». Thesis, KTH, Mekatronik, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-246087.

Texte intégral

Résumé :

With an increased amount of connected devices and the recent surge of artificial intelligence, the two technologies need more attention to fully bloom as a useful tool for creating new and exciting products. As machine learning traditionally is implemented on computers and online servers this thesis explores the possibility to extend machine learning to an embedded environment. This evaluation of existing machine learning in embedded systems with limited processing capa-bilities has been carried out in the specific context of an application involving classification of basic human movements. Previous research and implementations indicate that it is possible with some limitations, this thesis aims to answer which hardware limitation is affecting clas-sification and what classification accuracy the system can reach on an embedded device. The tests included human motion data from an existing dataset and included four different machine learning algorithms on three devices. Support Vector Machine (SVM) are found to be performing best com-pared to CART, Random Forest and AdaBoost. It reached a classification accuracy of 84,69% between six different included motions with a clas-sification time of 16,88 ms per classification on a Cortex M4 processor. This is the same classification accuracy as the one obtained on the host computer with more computational capabilities. Other hardware and machine learning algorithm combinations had a slight decrease in clas-sification accuracy and an increase in classification time. Conclusions could be drawn that memory on the embedded device affect which al-gorithms could be run and the complexity of data that can be extracted in form of features. Processing speed is mostly affecting classification time. Additionally the performance of the machine learning system is connected to the type of data that is to be observed, which means that the performance of different setups differ depending on the use case.
Antalet uppkopplade enheter ökar och det senaste uppsvinget av ar-tificiell intelligens driver forskningen framåt till att kombinera de två teknologierna för att både förbättra existerande produkter och utveckla nya. Maskininlärning är traditionellt sett implementerat på kraftfulla system så därför undersöker den här masteruppsatsen potentialen i att utvidga maskininlärning till att köras på inbyggda system. Den här undersökningen av existerande maskinlärningsalgoritmer, implemen-terade på begränsad hårdvara, har utförts med fokus på att klassificera grundläggande mänskliga rörelser. Tidigare forskning och implemen-tation visar på att det ska vara möjligt med vissa begränsningar. Den här uppsatsen vill svara på vilken hårvarubegränsning som påverkar klassificering mest samt vilken klassificeringsgrad systemet kan nå på den begränsande hårdvaran. Testerna inkluderade mänsklig rörelsedata från ett existerande dataset och inkluderade fyra olika maskininlärningsalgoritmer på tre olika system. SVM presterade bäst i jämförelse med CART, Random Forest och AdaBoost. Den nådde en klassifikationsgrad på 84,69% på de sex inkluderade rörelsetyperna med en klassifikationstid på 16,88 ms per klassificering på en Cortex M processor. Detta är samma klassifikations-grad som en vanlig persondator når med betydligt mer beräknings-resurserresurser. Andra hårdvaru- och algoritm-kombinationer visar en liten minskning i klassificeringsgrad och ökning i klassificeringstid. Slutsatser kan dras att minnet på det inbyggda systemet påverkar vilka algoritmer som kunde köras samt komplexiteten i datan som kunde extraheras i form av attribut (features). Processeringshastighet påverkar mest klassificeringstid. Slutligen är prestandan för maskininlärningsy-stemet bunden till typen av data som ska klassificeras, vilket betyder att olika uppsättningar av algoritmer och hårdvara påverkar prestandan olika beroende på användningsområde.

Styles APA, Harvard, Vancouver, ISO, etc.

40

Granström, Daria, et Johan Abrahamsson. « Loan Default Prediction using Supervised Machine Learning Algorithms ». Thesis, KTH, Matematisk statistik, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-252312.

Texte intégral

Résumé :

It is essential for a bank to estimate the credit risk it carries and the magnitude of exposure it has in case of non-performing customers. Estimation of this kind of risk has been done by statistical methods through decades and with respect to recent development in the field of machine learning, there has been an interest in investigating if machine learning techniques can perform better quantification of the risk. The aim of this thesis is to examine which method from a chosen set of machine learning techniques exhibits the best performance in default prediction with regards to chosen model evaluation parameters. The investigated techniques were Logistic Regression, Random Forest, Decision Tree, AdaBoost, XGBoost, Artificial Neural Network and Support Vector Machine. An oversampling technique called SMOTE was implemented in order to treat the imbalance between classes for the response variable. The results showed that XGBoost without implementation of SMOTE obtained the best result with respect to the chosen model evaluation metric.
Det är nödvändigt för en bank att ha en bra uppskattning på hur stor risk den bär med avseende på kunders fallissemang. Olika statistiska metoder har använts för att estimera denna risk, men med den nuvarande utvecklingen inom maskininlärningsområdet har det väckt ett intesse att utforska om maskininlärningsmetoder kan förbättra kvaliteten på riskuppskattningen. Syftet med denna avhandling är att undersöka vilken metod av de implementerade maskininlärningsmetoderna presterar bäst för modellering av fallissemangprediktion med avseende på valda modelvaldieringsparametrar. De implementerade metoderna var Logistisk Regression, Random Forest, Decision Tree, AdaBoost, XGBoost, Artificiella neurala nätverk och Stödvektormaskin. En översamplingsteknik, SMOTE, användes för att behandla obalansen i klassfördelningen för svarsvariabeln. Resultatet blev följande: XGBoost utan implementering av SMOTE visade bäst resultat med avseende på den valda metriken.

Styles APA, Harvard, Vancouver, ISO, etc.

41

Choi, Bong-Jin. « Statistical Analysis, Modeling, and Algorithms for Pharmaceutical and Cancer Systems ». Scholar Commons, 2014. https://scholarcommons.usf.edu/etd/5200.

Texte intégral

Résumé :

The aim of the present study is to develop a statistical algorithm and model associ- ated with breast and lung cancer patients. In this study, we developed several statistical softwares, R packages, and models using our new statistical approach. In the present study, we used the five parameters logistic model for determining the optimal doses of a pharmaceutical drugs, including dynamic initial points, an automatic process for outlier detection and an algorithm that develops a graphic user interface(GUI) program. The developed statistical procedure assists medical scientists by reducing their time in determining the optimal dose of new drugs, and can also easily identify which drugs need more experimentation. Secondly, in the present study, we developed a new classification method that is very useful in the health sciences. We used a new decision tree algorithm and a random forest method to rank our variables and to build a final decision tree model. The decision tree can identify and communicate complex data systems to scientists with minimal knowledge in statistics. Thirdly, we developed statistical packages using the Johnson SB probability distribu- tion which is important in parametrically studying a variety of health, environmental, and engineering problems. Scientists are experiencing difficulties in obtaining estimates for the four parameters of the subject probability distribution. The developed algorithm com- bines several statistical procedures, such as, the Newtwon Raphson, the Bisection, the Least Square Estimation, and the regression method to develop our R package. This R package has functions that generate random numbers, calculate probabilities, inverse probabilities, and estimate the four parameters of the SB Johnson probability distribution. Researchers can use the developed R package to build their own statistical models or perform desirable statistical simulations. The final aspect of the study involves building a statistical model for lung cancer sur- vival time. In developing the subject statistical model, we have taken into consideration the number of cigarettes the patient smoked per day, duration of smoking, and the age at diagnosis of lung cancer. The response variables the survival time. The significant factors include interaction. the probability density function of the survival times has been obtained and the survival function is determined. The analysis is have on your groups the involve gender and with factors. A companies with the ordinary survival function is given.

Styles APA, Harvard, Vancouver, ISO, etc.

42

Малік, Тимур Імтіазович. « Статистична модель прогнозування вартості автомобіля за даними автомобільного ринку України ». Bachelor's thesis, КПІ ім. Ігоря Сікорського, 2020. https://ela.kpi.ua/handle/123456789/37545.

Texte intégral

Résumé :

Дипломна робота містить: 117 с., 13 табл., 29 рис., 2 дод. та 46 джерел. Об’єктом дослідження є вибірка даних вторинного автомобільного ринку України за 2020 рік. Предметом дослідження є методи інтелектуального аналізу даних на основі регресії з використанням дерев рішень. Програмною мовою обрано Python. Метою роботи є визначення найкращої моделі для виконання прогнозу ціни автомобіля використовуючи дані вторинного автомобільного ринку України. В роботі проведено дослідження застосування дерев рішень та різних методів, що засновуються на них в розглядаємій задачі при прогнозуванні на основі існуючих даних 2020 року. Виділено основні фактори, які впливають на ціну. У ході дослідження було встановлено, що метод випадкових лісів дає хороші результати на досліджуваних даних. Планується розвивати роботу у напрямку дослідження застосування даного методу з метою подальшого зменшення похибки прогнозування у різних задачах, пов’язаних з прогнозуванням цін не лише автомобілів, але і інших транспортних засобів.
The bachelors work consists of: 117 p., 13 tables, 29 fig., 2 add. and 46 references. The object of the study is a sample of data from the secondary automotive market of Ukraine for 2020. The subject of research is methods of data mining based on regression using decision trees. Python is selected as the programming language. The aim of the work is to determine the best model for forecasting the price of a car using data from the secondary car market of Ukraine. The study of the application of decision trees and various methods based on them in this problem in forecasting based on existing data for 2020. The main factors that affect the price are highlighted. In course of the study, it was found that the method of random forests gives good results on the studied data. It is planned to develop work in the direction of research on the application of this method in order to further reduce the forecasting error in various tasks related to forecasting the prices not only of cars but also of other vehicles.

Styles APA, Harvard, Vancouver, ISO, etc.

43

Straková, Kristýna. « Datamining a využití rozhodovacích stromů při tvorbě Scorecards ». Master's thesis, Vysoká škola ekonomická v Praze, 2014. http://www.nusl.cz/ntk/nusl-201627.

Texte intégral

Résumé :

The thesis presents a comparison of several selected modeling methods used by financial institutions for (not exclusively) decision-making processes. First theoretical part describes well known modeling methods such as logistic regression, decision trees, neural networks, alternating decision trees and relatively new method called "Random forest". The practical part of thesis outlines some processes within financial institutions, in which selected modeling methods are used. On real data of two financial institutions logistic regression, decision trees and decision forest are compared which each other. Method of neural network is not included due to its complex interpretability. In conclusion, based on resulting models, thesis is trying to answers, whether logistic regression (method most widely used by financial institutions) remains most suitable.

Styles APA, Harvard, Vancouver, ISO, etc.

44

Кичигіна, Анастасія Юріївна. « Прогнозування ІМТ за допомогою методів машинного навчання ». Bachelor's thesis, КПІ ім. Ігоря Сікорського, 2020. https://ela.kpi.ua/handle/123456789/37413.

Texte intégral

Résumé :

Дипломна робота містить : 100 с., 17 табл., 16 рис., 2 дод. та 24 джерела. Об’єктом дослідження є індекс маси тіла людини. Предметом дослідження є методи машинного навчання – регресійні моделі, ансамблева модель випадковий ліс та нейронна мережа. В даній роботі проведено дослідження залежності індексу маси тіла людини та наявності надмірної маси тіла від харчових та побутових звичок. Для побудови дослідження були використані методи машинного навчання та аналізу даних, проведено роботу для визначення можливостей по покращенню роботи стандартних моделей та визначено кращу модель для реалізації прогнозування та класифікації на основі наведених даних. Напрямок роботи є в понижені розмірності простору ознак, відбору кращих спостережень з валідними даним для кращої роботи моделей, а також у комбінуванні різних методів навчання та отриманні більш ефективних ансамблевих моделей.
Thesis: 100 p., 17 tabl., 16 fig., 2 add. and 24 references. The object of the study is the human body mass index. The subject of research is machine learning methods - regression models, ensemble model random forest and neural network. In this paper, a study of the dependence of the human body mass index and the presence of excess body weight on eating and living habits. To build the study, the methods of machine learning and data analysis were used, work was done to identify opportunities to improve the performance of standard models and identified the best model for the implementation of predicting and classification based on the data. The direction of work is in the reduced dimensions of the feature space, selection of the best observations with valid data for better performance of models, as well as in combining different teaching methods and obtaining more effective ensemble models.

Styles APA, Harvard, Vancouver, ISO, etc.

45

Ekeberg, Lukas, et Alexander Fahnehjelm. « Maskininlärning som verktyg för att extrahera information om attribut kring bostadsannonser i syfte att maximera försäljningspris ». Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-240401.

Texte intégral

Résumé :

The Swedish real estate market has been digitalized over the past decade with the current practice being to post your real estate advertisement online. A question that has arisen is how a seller can optimize their public listing to maximize the selling premium. This paper analyzes the use of three machine learning methods to solve this problem: Linear Regression, Decision Tree Regressor and Random Forest Regressor. The aim is to retrieve information regarding how certain attributes contribute to the premium value. The dataset used contains apartments sold within the years of 2014-2018 in the Östermalm / Djurgården district in Stockholm, Sweden. The resulting models returned an R2-value of approx. 0.26 and Mean Absolute Error of approx. 0.06. While the models were not accurate regarding prediction of premium, information was still able to be extracted from the models. In conclusion, a high amount of views and a publication made in April provide the best conditions for an advertisement to reach a high selling premium. The seller should try to keep the amount of days since publication lower than 15.5 days and avoid publishing on a Tuesday.
Den svenska bostadsmarknaden har blivit alltmer digitaliserad under det senaste årtiondet med nuvarande praxis att säljaren publicerar sin bostadsannons online. En fråga som uppstår är hur en säljare kan optimera sin annons för att maximera budpremie. Denna studie analyserar tre maskininlärningsmetoder för att lösa detta problem: Linear Regression, Decision Tree Regressor och Random Forest Regressor. Syftet är att utvinna information om de signifikanta attribut som påverkar budpremien. Det dataset som använts innehåller lägenheter som såldes under åren 2014-2018 i Stockholmsområdet Östermalm / Djurgården. Modellerna som togs fram uppnådde ett R²-värde på approximativt 0.26 och Mean Absolute Error på approximativt 0.06. Signifikant information kunde extraheras from modellerna trots att de inte var exakta i att förutspå budpremien. Sammanfattningsvis skapar ett stort antal visningar och en publicering i april de bästa förutsättningarna för att uppnå en hög budpremie. Säljaren ska försöka hålla antal dagar sedan publicering under 15.5 dagar och undvika att publicera på tisdagar.

Styles APA, Harvard, Vancouver, ISO, etc.

46

Consuegra, Rengifo Nathan Adolfo. « Detection and Classification of Anomalies in Road Traffic using Spark Streaming ». Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-238733.

Texte intégral

Résumé :

Road traffic control has been around for a long time to guarantee the safety of vehicles and pedestrians. However, anomalies such as accidents or natural disasters cannot be avoided. Therefore, it is important to be prepared as soon as possible to prevent a higher number of human losses. Nevertheless, there is no system accurate enough that detects and classifies anomalies from the road traffic in real time. To solve this issue, the following study proposes the training of a machine learning model for detection and classification of anomalies on the highways of Stockholm. Due to the lack of a labeled dataset, the first phase of the work is to detect the different kind of outliers that can be found and manually label them based on the results of a data exploration study. Datasets containing information regarding accidents and weather are also included to further expand the amount of anomalies. All experiments use real world datasets coming from either the sensors located on the highways of Stockholm or from official accident and weather reports. Then, three models (Decision Trees, Random Forest and Logistic Regression) are trained to detect and classify the outliers. The design of an Apache Spark streaming application that uses the model with the best results is also provided. The outcomes indicate that Logistic Regression is better than the rest but still suffers from the imbalanced nature of the dataset. In the future, this project can be used to not only contribute to future research on similar topics but also to monitor the highways of Stockholm.
Vägtrafikkontroll har funnits länge för att garantera säkerheten hos fordon och fotgängare. Emellertid kan avvikelser som olyckor eller naturkatastrofer inte undvikas. Därför är det viktigt att förberedas så snart som möjligt för att förhindra ett större antal mänskliga förluster. Ändå finns det inget system som är noggrannt som upptäcker och klassificerar avvikelser från vägtrafiken i realtid. För att lösa detta problem föreslår följande studie utbildningen av en maskininlärningsmodell för detektering och klassificering av anomalier på Stockholms vägar. På grund av bristen på en märkt dataset är den första fasen av arbetet att upptäcka olika slags avvikare som kan hittas och manuellt märka dem utifrån resultaten av en datautforskningsstudie. Dataset som innehåller information om olyckor och väder ingår också för att ytterligare öka antalet anomalier. Alla experiment använder realtidsdataset från antingen sensorerna på Stockholms vägar eller från officiella olyckor och väderrapporter. Därefter utbildas tre modeller (beslutsträd, slumpmässig skog och logistisk regression) för att upptäcka och klassificera outliersna. Utformningen av en Apache Spark streaming-applikation som använder modellen med de bästa resultaten ges också. Resultaten tyder på att logistisk regression är bättre än resten men fortfarande lider av datasetets obalanserade natur. I framtiden kan detta projekt användas för att inte bara bidra till framtida forskning kring liknande ämnen utan även att övervaka Stockholms vägar.

Styles APA, Harvard, Vancouver, ISO, etc.

47

Helle, Valeria, Andra-Stefania Negus et Jakob Nyberg. « Improving armed conflict prediction using machine learning : ViEWS+ ». Thesis, Uppsala universitet, Institutionen för informationsteknologi, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-354845.

Texte intégral

Résumé :

Our project, ViEWS+, expands the software functionality of the Violence EarlyWarning System (ViEWS). ViEWS aims to predict the probabilities of armed conflicts in the next 36 months using machine learning. Governments and policy-makers may use conflict predictions to decide where to deliver aid and resources, potentially saving lives. The predictions use conflict data gathered by ViEWS, which includes variables like past conflicts, child mortality and urban density. The large number of variables raises the need for a selection tool to remove those that are irrelevant for conflict prediction. Before our work, the stakeholders used their experience and some guesswork to pick the variables, and the predictive function with its parameters. Our goals were to improve the efficiency, in terms of speed, and correctness of the ViEWS predictions. Three steps were taken. Firstly, we made an automatic variable selection tool. This helps researchers use fewer, more relevant variables, to save time and resources. Secondly, we compared prediction functions, and identified the best for the purpose of predicting conflict. Lastly, we tested how parameter values affect the performance of the chosen functions, so as to produce good predictions but also reduce the execution time. The new tools improved both the execution time and the predictive correctness of the system compared to the results obtained prior to our project. It is now nine times faster than before, and its correctness has improved by a factor of three. We believe our work leads to more accurate conflict predictions, and as ViEWS has strong connections to the European Union, we hope that decision makers can benefit from it when trying to prevent conflicts.
I detta projekt, vilket vi valt att benämna ViEWS+, har vi förbättrat olika aspekter av ViEWS (Violence Early-Warning System), ett system som med maskinlärning försöker förutsäga var i världen väpnade konflikter kommer uppstå. Målet med ViEWS är att kunna förutsäga sannolikheten för konflikter så långt som 36 månader i framtiden. Målet med att förutsäga sannoliketen för konflikter är att politiker och beslutsfattare ska kunna använda dessa kunskaper för att förhindra dem. Indata till systemet är konfliktdata med ett stort antal egenskaper, så som tidigare konflikter, barnadödlighet och urbanisering. Dessa är av varierande användbarhet, vilket skapar ett behov för att sålla ut de som inte är användbara för att förutsäga framtida konflikter. Innan vårt projekt har forskarna som använder ViEWS valt ut egenskaper för hand, vilket blir allt svårare i och med att fler introduceras. Forskargruppen hade även ingen formell metodik för att välja parametervärden till de maskinlärningsfunktioner de använder. De valde parametrar baserat på erfarenhet och känsla, något som kan leda till onödigt långa exekveringstider och eventuellt sämre resultat beroende på funktionen som används. Våra mål med projektet var att förbättra systemets produktivitet, i termer av exekveringstid och säkerheten i förutsägelserna. För att uppnå detta utvecklade vi analysverktyg för att försöka lösa de existerande problemen. Vi har utvecklat ett verktyg för att välja ut färre, mer användbara, egenskaper från datasamlingen. Detta gör att egenskaper som inte tillför någon viktig information kan sorteras bort vilket sparar exekveringstid. Vi har även jämfört prestandan hos olika maskinlärningsfunktioner, för att identifiera de bäst lämpade för konfliktprediktion. Slutligen har vi implementerat ett verktyg för att analysera hur resultaten från funktionerna varierar efter valet av parametrar. Detta gör att man systematiskt kan bestämma vilka parametervärden som bör väljas för att garantera bra resultat samtidigt som exekveringstid hålls nere. Våra resultat visar att med våra förbättringar sänkes exekveringstiden med en faktor av omkring nio och förutsägelseförmågorna höjdes med en faktor av tre. Vi hoppas att vårt arbete kan leda till säkrare föutsägelser och vilket i sin tur kanske leder till en fredligare värld.

Styles APA, Harvard, Vancouver, ISO, etc.

48

Fernandez, Sanchez Javier. « Knowledge Discovery and Data Mining Using Demographic and Clinical Data to Diagnose Heart Disease ». Thesis, KTH, Skolan för kemi, bioteknologi och hälsa (CBH), 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-233978.

Texte intégral

Résumé :

Cardiovascular disease (CVD) is the leading cause of morbidity, mortality, premature death and reduced quality of life for the citizens of the EU. It has been reported that CVD represents a major economic load on health care sys- tems in terms of hospitalizations, rehabilitation services, physician visits and medication. Data Mining techniques with clinical data has become an interesting tool to prevent, diagnose or treat CVD. In this thesis, Knowledge Dis- covery and Data Mining (KDD) was employed to analyse clinical and demographic data, which could be used to diagnose coronary artery disease (CAD). The exploratory data analysis (EDA) showed that female patients at an el- derly age with a higher level of cholesterol, maximum achieved heart rate and ST-depression are more prone to be diagnosed with heart disease. Furthermore, patients with atypical angina are more likely to be at an elderly age with a slightly higher level of cholesterol and maximum achieved heart rate than asymptotic chest pain patients. More- over, patients with exercise induced angina contained lower values of maximum achieved heart rate than those who do not experience it. We could verify that patients who experience exercise induced angina and asymptomatic chest pain are more likely to be diagnosed with heart disease. On the other hand, Logistic Regression, K-Nearest Neighbors, Support Vector Machines, Decision Tree, Bagging and Boosting methods were evaluated by adopting a stratified 10 fold cross-validation approach. The learning models provided an average of 78-83% F-score and a mean AUC of 85-88%. Among all the models, the highest score is given by Radial Basis Function Kernel Support Vector Machines (RBF-SVM), achieving 82.5% ± 4.7% of F-score and an AUC of 87.6% ± 5.8%. Our research con- firmed that data mining techniques can support physicians in their interpretations of heart disease diagnosis in addition to clinical and demographic characteristics of patients.

Styles APA, Harvard, Vancouver, ISO, etc.

49

Elkin, Colin P. « Development of Adaptive Computational Algorithms for Manned and Unmanned Flight Safety ». University of Toledo / OhioLINK, 2018. http://rave.ohiolink.edu/etdc/view?acc_num=toledo1544640516618623.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

50

Madrigali, Andrea. « Analysis of Local Search Methods for 3D Data ». Bachelor's thesis, Alma Mater Studiorum - Università di Bologna, 2016.

Trouver le texte intégral

Résumé :

In questa tesi sono stati analizzati alcuni metodi di ricerca per dati 3D. Viene illustrata una panoramica generale sul campo della Computer Vision, sullo stato dell’arte dei sensori per l’acquisizione e su alcuni dei formati utilizzati per la descrizione di dati 3D. In seguito è stato fatto un approfondimento sulla 3D Object Recognition dove, oltre ad essere descritto l’intero processo di matching tra Local Features, è stata fatta una focalizzazione sulla fase di detection dei punti salienti. In particolare è stato analizzato un Learned Keypoint detector, basato su tecniche di apprendimento di machine learning. Quest ultimo viene illustrato con l’implementazione di due algoritmi di ricerca di vicini: uno esauriente (K-d tree) e uno approssimato (Radial Search). Sono state riportate infine alcune valutazioni sperimentali in termini di efficienza e velocità del detector implementato con diversi metodi di ricerca, mostrando l’effettivo miglioramento di performance senza una considerabile perdita di accuratezza con la ricerca approssimata.

Styles APA, Harvard, Vancouver, ISO, etc.

Nous offrons des réductions sur tous les plans premium pour les auteurs dont les œuvres sont incluses dans des sélections littéraires thématiques. Contactez-nous pour obtenir un code promo unique!