Дисертації з теми "Modèle « Random Forest »"

Щоб переглянути інші типи публікацій з цієї теми, перейдіть за посиланням: Modèle « Random Forest ».

Оформте джерело за APA, MLA, Chicago, Harvard та іншими стилями

Оберіть тип джерела:

Ознайомтеся з топ-50 дисертацій для дослідження на тему "Modèle « Random Forest »".

Біля кожної праці в переліку літератури доступна кнопка «Додати до бібліографії». Скористайтеся нею – і ми автоматично оформимо бібліографічне посилання на обрану працю в потрібному вам стилі цитування: APA, MLA, «Гарвард», «Чикаго», «Ванкувер» тощо.

Також ви можете завантажити повний текст наукової публікації у форматі «.pdf» та прочитати онлайн анотацію до роботи, якщо відповідні параметри наявні в метаданих.

Переглядайте дисертації для різних дисциплін та оформлюйте правильно вашу бібліографію.

1

Mita, Mara. "Assessment of seismic displacements of existing landslides through numerical modelling and simplified methods." Electronic Thesis or Diss., Université Gustave Eiffel, 2023. http://www.theses.fr/2023UEFL2075.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Les glissements de terrain sismo-induits sont des effets secondaires fréquents des séismes qui peuvent provoquer des dommages plus importants que les séismes eux-mêmes. Prévoir ces phénomènes est donc essentiel pour la gestion des risques dans les régions sismiques. Les déplacements co-sismiques sont généralement évalués par la méthode « bloc rigide » de Newmark (1965). Malgré ses limites, cette méthode a deux avantages: i) des temps de calcul relativement courts, ii) une compatibilité avec les logiciels SIG pour des analyses à l'échelle régionale. Les modélisations numériques complexes permettent quant à elles de simuler la propagation des ondes sismiques dans les versants et les effets associés. Cependant, elles sont caractérisées par des temps de calcul longs, ce qui limite leur utilisation à l'échelle des versants. L'objectif de cette étude est de mieux comprendre dans quel cas les méthodes analytiques et numériques prédisent des valeurs de déplacements différentes. 216 prototypes de glissements de terrain ont été définis en 2D en combinant des paramètres géométriques et géotechniques déduits de la littérature. Ces modèles ont été soumis à 17 signaux sismiques d'Intensité Arias constante (IA~ 0,1 m/s) et de période moyenne variable. Les résultats ont permis de définir un modèle « Random Forest » préliminaire pour prédire a priori la différence entre les valeurs de déplacements des deux méthodes. Les résultats ont ainsi permis : i) d'identifier les paramètres qui contrôlent les déplacements dans les deux méthodes, ii) de conclure que les différences entre les valeurs de déplacements sont négligeables dans la plupart des cas pour cette valeur de IA
Landslides are common secondary effects related to earthquakes which can be responsible for greater damages than the ground shaking alone. Predicting these phenomena is therefore essential for risk management in seismic regions. Nowadays, landslides permanent co-seismic displacements are assessed by the traditional « rigid-sliding block » method proposed by Newmark (1965). Despite its limitations, this method has two advantages: i) relatively short computation times, ii) compatibility with GIS software for regional-scale analyses. Alternatively, more complex numerical analyses can be performed to simulate seismic waves propagation into slopes and related effects. However, due to their longer computation times, their use is usually limited to slope-scale analyses. This study aims at better understanding in which conditions (i.e. combinations of introduced relevant parameters), analytical and numerical methods predict different landslides earthquake-induced displacements. At this regard, 216 2D landslide prototypes were designed by combining geometrical and geotechnical parameters inferred by statistical analysis on data collected by literature review. Landslide prototypes were forced by 17 signals with constant Arias Intensity (AI ~ 0.1 m/s) and variable mean period. Results allowed defining a preliminary Random Forest model to predict a priori, the expected difference between displacements by the two methods. Analysis of results allowed: i) identifying parameters affecting displacement variation according to the two methods, ii) concluding that in here considered AI level, computed displacements differences are negligible in most of the cases
2

Walschaerts, Marie. "La santé reproductive de l'homme : méthodologie et statistique." Toulouse 3, 2011. http://thesesups.ups-tlse.fr/1470/.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
La santé reproductive de l'homme est un indicateur de la santé générale de celui-ci. Elle est également intimement liée aux expositions de l'environnement et du milieu de vie. Aujourd'hui, on observe une baisse séculaire de la qualité du sperme et une augmentation des pathologies et malformations de l'appareil reproducteur masculin. L'objectif de ce travail est d'étudier la santé reproductive de l'homme d'un point de vue épidémiologique et par le biais de différents outils statistiques. Nous nous sommes intéressés à l'incidence et aux facteurs de risque du cancer du testicule. Ensuite, nous avons étudié le parcours thérapeutique d'une population d'hommes ayant consulté pour infécondité masculine en analysant la relation entre leurs paramètres du sperme et leurs investigations andrologiques, ainsi que l'issue de leur projet parental. Enfin, l'événement naissance a été analysé en tenant compte de son délai de réalisation en utilisant des modèles de durées de vie incluant la censure à droite : modèles de Cox et arbres de survie. En y associant des techniques de sélection stable de variables (stepwise, pénalisation de type L1, bootstrap) les performances prédictives de ces méthodes ainsi que leur capacité à fournir aux cliniciens un modèle facilement interprétable ont été comparées. Dans le sud de la France, l'incidence du cancer du testicule a doublé au cours des 20 dernières années. L'effet cohorte de naissance, c'est-à-dire l'effet générationnel, suggère un effet délétère des expositions environnementales sur la santé reproductive de l'homme. Toutefois, aucune exposition du milieu de vie de l'homme durant sa vie adulte ne semble être un facteur de risque potentiel de survenue du cancer du testicule, suggérant l'hypothèse d'une exposition à des perturbateurs endocriniens in utero. Actuellement, la responsabilité de l'homme dans des difficultés à concevoir représente 50% des causes d'infertilité. La prise en charge de l'homme est donc essentielle. Dans notre cohorte de couples consultant pour des problèmes d'infertilité, un examen andrologique anormal est observé chez 85% des partenaires masculins. Une relation significative est observée entre l'examen de sperme et l'examen andrologique, suggérant la nécessité de pratiquer des investigations cliniques afin d'identifier les causes d'infertilité masculine. Finalement, un couple sur deux réussit à avoir un enfant. Et l'âge des hommes de plus de 35 ans apparaît comme un facteur de risque majeur, ce qui devrait encourager les couples à entamer leur projet parental le plus tôt possible. En prenant en compte la composante temporelle dans l'issue reproductive de ces couples inféconds, les modèles de durée de vie obtenus sont très souvent instables du fait du grand nombre de covariables. Nous avons intégré des techniques de rééchantillonnage à des approches de sélection de variables. Bien que l'approche Random Survival Forests soit la meilleure en qualité de prévision, ses résultats ne sont pas facilement interprétables. Concernant les autres méthodes, les résultats diffèrent selon la taille de l'échantillon. L'algorithme stepwise intégré au modèle de Cox ne converge pas si le nombre d'événements est trop faible. L'approche qui consiste à choisir la meilleure division en bootstrappant l'échantillon à chaque noeud lors de la construction d'un arbre de survie ne paraît pas avoir une meilleure capacité prédictive qu'un simple arbre de survie quand l'échantillon est suffisamment grand. Finalement, le modèle de Cox avec une sélection de variables par pénalisation de type L1 donne un bon compromis entre facilité d'interprétation et prévision dans le cas de petits échantillons
Male reproductive health is an indicator of his overall health. It is also closely linked to environmental exposures and living habits. Nowadays, surveillance of male fertility shows a secular decline in sperm quality and increased disease and malformations of the male reproductive tract. The objective of this work is to study the male reproductive health in an epidemiologic aspect and through various statistical tools. Initially, we were interested in the pathology of testicular cancer, its incidence and its risk factors. Then, we studied the population of men consulting for male infertility, their andrological examination, their therapeutic care and their parenthood project. Finally, the birth event was analyzed through survival models: the Cox model and the survival trees. We compared different methods of stable selection variables (the stepwise bootstrapped and the bootstrap penalisation L1 method based on Cox model, and the bootstrap node-level stabilization method and random survival forests) in order to obtain a final model easy to interpret and which improve prediction. In South of France, the incidence of testicular cancer doubled over the past 20 years. The birth cohort effect, i. E. The generational effect, suggests a hypothesis of a deleterious effect of environmental exposure on male reproductive health. However, the living environment of man during his adult life does not seem to be a potential risk factor for testicular cancer, suggesting hypothesis of exposure to endocrine disruptors in utero. The responsibility of man for difficulties in conceiving represents 50% of cases of infertility, making the management of male infertility essential. In our cohort, 85% of male partners presented an abnormal clinical examination (either a medical history or the presence of an anomaly in andrological examination). Finally, one in two couples who consulted for male infertility successfully had a child. The age of men over 35 appears to be a major risk factor, which should encourage couples to start their parenthood project earlier. Taking into account the survival time in the reproductive outcome of these infertile couples, the inclusion of large numbers of covariates gives models often unstable. We associated the bootstrap method to variables selection approaches. Although the method of Random Survival Forests is the best in the prediction performance, the results are not easily interpretable. Results are different according to the size of the sample. Based on the Cox model, the stepwise algorithm is inappropriate when the number of events is too small. The bootstrap node-level stabilization method does not seem better in prediction performance than a simple survival tree (difficulty to prune the tree). Finally, the Cox model based on selection variables with the penalisation L1 method seems a good compromise between interpretation and prediction
3

Asritha, Kotha Sri Lakshmi Kamakshi. "Comparing Random forest and Kriging Methods for Surrogate Modeling." Thesis, Blekinge Tekniska Högskola, Fakulteten för datavetenskaper, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-20230.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
The issue with conducting real experiments in design engineering is the cost factor to find an optimal design that fulfills all design requirements and constraints. An alternate method of a real experiment that is performed by engineers is computer-aided design modeling and computer-simulated experiments. These simulations are conducted to understand functional behavior and to predict possible failure modes in design concepts. However, these simulations may take minutes, hours, days to finish. In order to reduce the time consumption and simulations required for design space exploration, surrogate modeling is used. \par Replacing the original system is the motive of surrogate modeling by finding an approximation function of simulations that is quickly computed. The process of surrogate model generation includes sample selection, model generation, and model evaluation. Using surrogate models in design engineering can help reduce design cycle times and cost by enabling rapid analysis of alternative designs.\par Selecting a suitable surrogate modeling method for a given function with specific requirements is possible by comparing different surrogate modeling methods. These methods can be compared using different application problems and evaluation metrics. In this thesis, we are comparing the random forest model and kriging model based on prediction accuracy. The comparison is performed using mathematical test functions. This thesis conducted quantitative experiments to investigate the performance of methods. After experimental analysis, it is found that the kriging models have higher accuracy compared to random forests. Furthermore, the random forest models have less execution time compared to kriging for studied mathematical test problems.
4

Pettersson, Anders. "High-Dimensional Classification Models with Applications to Email Targeting." Thesis, KTH, Matematisk statistik, 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-168203.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Email communication is valuable for any modern company, since it offers an easy mean for spreading important information or advertising new products, features or offers and much more. To be able to identify which customers that would be interested in certain information would make it possible to significantly improve a company's email communication and as such avoiding that customers start ignoring messages and creating unnecessary badwill. This thesis focuses on trying to target customers by applying statistical learning methods to historical data provided by the music streaming company Spotify. An important aspect was the high-dimensionality of the data, creating certain demands on the applied methods. A binary classification model was created, where the target was whether a customer will open the email or not. Two approaches were used for trying to target the costumers, logistic regression, both with and without regularization, and random forest classifier, for their ability to handle the high-dimensionality of the data. Performance accuracy of the suggested models were then evaluated on both a training set and a test set using statistical validation methods, such as cross-validation, ROC curves and lift charts. The models were studied under both large-sample and high-dimensional scenarios. The high-dimensional scenario represents when the number of observations, N, is of the same order as the number of features, p and the large sample scenario represents when N ≫ p. Lasso-based variable selection was performed for both these scenarios, to study the informative value of the features. This study demonstrates that it is possible to greatly improve the opening rate of emails by targeting users, even in the high dimensional scenario. The results show that increasing the amount of training data over a thousand fold will only improve the performance marginally. Rather efficient customer targeting can be achieved by using a few highly informative variables selected by the Lasso regularization.
Företag kan använda e-mejl för att på ett enkelt sätt sprida viktig information, göra reklam för nya produkter eller erbjudanden och mycket mer, men för många e-mejl kan göra att kunder slutar intressera sig för innehållet, genererar badwill och omöjliggöra framtida kommunikation. Att kunna urskilja vilka kunder som är intresserade av det specifika innehållet skulle vara en möjlighet att signifikant förbättra ett företags användning av e-mejl som kommunikationskanal. Denna studie fokuserar på att urskilja kunder med hjälp av statistisk inlärning applicerad på historisk data tillhandahållen av musikstreaming-företaget Spotify. En binärklassificeringsmodell valdes, där responsvariabeln beskrev huruvida kunden öppnade e-mejlet eller inte. Två olika metoder användes för att försöka identifiera de kunder som troligtvis skulle öppna e-mejlen, logistisk regression, både med och utan regularisering, samt random forest klassificerare, tack vare deras förmåga att hantera högdimensionella data. Metoderna blev sedan utvärderade på både ett träningsset och ett testset, med hjälp av flera olika statistiska valideringsmetoder så som korsvalidering och ROC kurvor. Modellerna studerades under både scenarios med stora stickprov och högdimensionella data. Där scenarion med högdimensionella data representeras av att antalet observationer, N, är av liknande storlek som antalet förklarande variabler, p, och scenarion med stora stickprov representeras av att N ≫ p. Lasso-baserad variabelselektion utfördes för båda dessa scenarion för att studera informationsvärdet av förklaringsvariablerna. Denna studie visar att det är möjligt att signifikant förbättra öppningsfrekvensen av e-mejl genom att selektera kunder, även när man endast använder små mängder av data. Resultaten visar att en enorm ökning i antalet träningsobservationer endast kommer förbättra modellernas förmåga att urskilja kunder marginellt.
5

Henriksson, Erik, and Kristopher Werlinder. "Housing Price Prediction over Countrywide Data : A comparison of XGBoost and Random Forest regressor models." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-302535.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
The aim of this research project is to investigate how an XGBoost regressor compares to a Random Forest regressor in terms of predictive performance of housing prices with the help of two data sets. The comparison considers training time, inference time and the three evaluation metrics R2, RMSE and MAPE. The data sets are described in detail together with background about the regressor models that are used. The method makes substantial data cleaning of the two data sets, it involves hyperparameter tuning to find optimal parameters and 5foldcrossvalidation in order to achieve good performance estimates. The finding of this research project is that XGBoost performs better on both small and large data sets. While the Random Forest model can achieve similar results as the XGBoost model, it needs a much longer training time, between 2 and 50 times as long, and has a longer inference time, around 40 times as long. This makes it especially superior when used on larger sets of data.
Målet med den här studien är att jämföra och undersöka hur en XGBoost regressor och en Random Forest regressor presterar i att förutsäga huspriser. Detta görs med hjälp av två stycken datauppsättningar. Jämförelsen tar hänsyn till modellernas träningstid, slutledningstid och de tre utvärderingsfaktorerna R2, RMSE and MAPE. Datauppsättningarna beskrivs i detalj tillsammans med en bakgrund om regressionsmodellerna. Metoden innefattar en rengöring av datauppsättningarna, sökande efter optimala hyperparametrar för modellerna och 5delad korsvalidering för att uppnå goda förutsägelser. Resultatet av studien är att XGBoost regressorn presterar bättre på både små och stora datauppsättningar, men att den är överlägsen när det gäller stora datauppsättningar. Medan Random Forest modellen kan uppnå liknande resultat som XGBoost modellen, tar träningstiden mellan 250 gånger så lång tid och modellen får en cirka 40 gånger längre slutledningstid. Detta gör att XGBoost är särskilt överlägsen vid användning av stora datauppsättningar.
6

Hawkins, Susan. "The stability of host-pathogen multi-strain models." Thesis, University of Oxford, 2017. http://ora.ox.ac.uk/objects/uuid:c324b259-57ee-4cc4-b68c-21b4d98414da.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Previous multi-strain mathematical models have elucidated that the degree of cross-protective responses between similar strains, acting as a form of immune selection, generates different behavioural states of the pathogen population. This thesis explores these multi-strain dynamic states, to examine their robustness and stability in the face of pathogenic intrinsic phenotypic variation, and the extrinsic force of immune selection. This is achieved in two main ways: Chapter 2 introduces phenotypic variation in pathogen transmissibility, testing the robustness of a stable pathogen population to the emergence of an introduced strain of higher transmission potential; and Chapter 3 introduces a new model with a possibility of immunity to both strain-specific and cross-strain (conserved) determinants, to investigate how heterogeneity in the specificity of a host immune response alters the pathogen population structure. A final investigation in Chapter 4 develops a method of reverse-pattern oriented modelling using a machine learning algorithm to determine which intrinsic properties of the pathogen, and their combinations, lead to particular disease-like population patterns. This research offers novel techniques to complement previous and ongoing work on multi-strain modelling, with direct applications to a range of infectious agents such as Plasmodium falciparum, influenza A, and rotavirus, but also with a wider potential for other multi-strain systems.
7

Ferrat, L. "Machine learning and statistical analysis of complex mathematical models : an application to epilepsy." Thesis, University of Exeter, 2019. http://hdl.handle.net/10871/36090.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
The electroencephalogram (EEG) is a commonly used tool for studying the emergent electrical rhythms of the brain. It has wide utility in psychology, as well as bringing a useful diagnostic aid for neurological conditions such as epilepsy. It is of growing importance to better understand the emergence of these electrical rhythms and, in the case of diagnosis of neurological conditions, to find mechanistic differences between healthy individuals and those with a disease. Mathematical models are an important tool that offer the potential to reveal these otherwise hidden mechanisms. In particular Neural Mass Models (NMMs), which describe the macroscopic activity of large populations of neurons, are increasingly used to uncover large-scale mechanisms of brain rhythms in both health and disease. The dynamics of these models is dependent upon the choice of parameters, and therefore it is crucial to be able to understand how dynamics change when parameters are varied. Despite they are considered low-dimensional in comparison to micro-scale neural network models, with regards to understanding the relationship between parameters and dynamics NMMs are still prohibitively high dimensional for classical approaches such as numerical continuation. We need alternative methods to characterise the dynamics of NMMs in high dimensional parameter spaces. The primary aim of this thesis is to develop a method to explore and analyse the high dimensional parameter space of these mathematical models. We develop an approach based on statistics and machine learning methods called decision tree mapping (DTM). This method is used to analyse the parameter space of a mathematical model by studying all the parameters simultaneously. With this approach, the parameter space can efficiently be mapped in high dimension. We have used measures linked with this method to determine which parameters play a key role in the output of the model. This approach recursively splits the parameter space into smaller subspaces with an increasing homogeneity of dynamics. The concepts of decision tree learning, random forest, measures of importance, statistical tests and visual tools are introduced to explore and analyse the parameter space. We introduce formally the theoretical background and the methods with examples. The DTM approach is used in three distinct studies to: • Identify the role of parameters on the dynamic model. For example, which parameters have a role in the emergence of seizure dynamics? • Constrain the parameter space, such that regions of the parameter space which give implausible dynamic are removed. • Compare the parameter sets to fit different groups. How does the thalamocortical connectivity of people with and without epilepsy differ? We demonstrate that classical studies have not taken into account the complexity of the parameter space. DTM can easily be extended to other fields using mathematical models. We advocate the use of this method in the future to constrain high dimensional parameter spaces in order to enable more efficient, person-specific model calibration.
8

Castillo, Beldaño Ana Isabel. "Modelo de fuga y políticas de retención en una empresa de mejoramiento del hogar." Tesis, Universidad de Chile, 2014. http://repositorio.uchile.cl/handle/2250/130827.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Memoria para optar al título de Ingeniera Civil Industrial
El dinamismo que ha presentado la industria del mejoramiento del hogar en el último tiempo, ha llevado a que las empresas involucradas deban preocuparse por entender el comportamiento de compra de sus consumidores, ya que no solo deben enfocar sus recursos y estrategias en capturar nuevos clientes sino también en la retención de éstos. El objetivo de este trabajo es estimar la fuga de clientes en una empresa de mejoramiento del hogar con el fin de generar estrategias de retención. Para ello se definirán criterios de fuga y se determinarán probabilidades para gestionar acciones sobre una fracción de clientes propensos a fugarse. Para alcanzar los objetivos mencionados, se trabajará sólo con clientes que forman parte de la cartera de un vendedor y se hará uso de las siguientes herramientas: estadística descriptiva, técnica RFM y la comparación de los modelos predictivos Árbol de decisión y Random Forest, donde la principal diferencia de estos últimos es la cantidad de variables y árboles que se construyen para la predicción de las probabilidades de fuga. Los resultados obtenidos entregan tres criterios de fuga, de manera que un cliente es catalogado como fugado cuando supera cualquiera de las cotas máximas, es decir, 180 días para el caso del recency, 20 para R/F o una variación de monto menores al -80%, por lo que la muestra queda definida con un 53,9% de clientes fugados versus un 46,1% de clientes activos. Con respecto a los modelos predictivos se tiene que el Árbol de decisión entrega un mejor nivel de certeza con un 84,1% versus un 74,7% del Random Forest, por lo que se eligió el primero obteniendo a través de las probabilidades de fuga 4 tipos de clientes: Leales (37,9%), Normales (7,8%), Propensos a fugarse (15,6%) y Fugados (38,7%). Se tiene que las causas de fuga corresponden a largos períodos de inactividad, atrasos en los ciclos de compras y una disminución en los montos y números de transacciones al igual que un aumento en el monto de transacciones negativas aludidas directamente a devoluciones y notas de crédito, por lo que las principales acciones de retención serían promociones, club de fidelización, descuentos personalizados y mejorar gestión en despachos y niveles de stock para que el cliente vuelva efectuar una compra en un menor plazo. Finalmente, a partir de este trabajo, se concluye que al retener 5% de clientes de probabilidades entre [0,5 y 0,75] y con el 50% de los mayores montos de transacciones se obtienen ingresos por USD $205 mil en 6 meses, representando el 5,5% de los clientes. Se propone validar este trabajo en nuevos clientes, generar alguna encuesta de satisfacción y mejorar el desempeño de los vendedores con una optimización de cartera.
9

Teang, Kanha, and Yiran Lu. "Property Valuation by Machine Learning and Hedonic Pricing Models : A Case study on Swedish Residential Property." Thesis, KTH, Fastigheter och byggande, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-298307.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Property valuation is a critical concept for a variety of applications in the real estate market such as transactions, taxes, investments, and mortgages. However, there is little consistency in which method is the best for estimating the property value. This paper aims at investigating and comparing the differences in the Stockholm residential property valuation results among parametric hedonic pricing models (HPM) including linear and log-linear regression models, and Random Forest (RF) as the machine learning algorithm. The data consists of 114,293 arm-length transactions of the tenant-owned apartment between January 2005 to December 2014. The same variables are applied on both the HPM regression models and RF. There are two adopted techniques for data splitting into training and testing datasets, randomly splits and splitting based on the transaction years. These datasets will be used to train and test all the models. The performance evaluation and measurement of each model will base on four performance indicators: R-squared, MSE, RMSE, and MAPE.   The results from both data splitting circumstances have shown that the accuracy of random forest is the highest among the regression models. The discussions point out the causes of the models’ performance changes once applied on different datasets obtained from different data splitting techniques. Limitations are also pointed out at the end of the study for future improvements.
Fastighetsvärdering är ett kritiskt koncept för en mängd olika applikationer på fastighetsmarknaden som transaktioner, skatter, investeringar och inteckningar. Det finns dock liten konsekvens i vilken metod som är bäst för att uppskatta fastighetsvärdet. Denna uppsats syftar till att undersöka och jämföra skillnaderna i Stockholms fastighetsvärderingsresultat bland parametriska hedoniska prissättningsmodeller (HPM) inklusive linjära och log-linjära regressionsmodeller, och Random Forest (RF) som maskininlärningsalgoritm. Uppgifterna består av 114,293 armlängds-transaktioner för hyresgästen från januari 2005 till december 2014. Samma variabler tillämpas på både HPM-regressionsmodellerna och RF. Det finns två antagna tekniker för uppdelning av data i utbildning och testning av datamängder: slumpmässig uppdelning och uppdelning baserat på transaktionsåren. Dessa datamängder kommer att användas för att träna och testa alla modeller. Prestationsutvärderingen och mätningen av varje modell baseras på fyra resultatindikatorer: R-kvadrat, MSE, RMSE och MAPE. Resultaten från båda uppdelningsförhållandena har visat att noggrannheten hos slumpmässig skog är den högsta bland regressionsmodellerna. Diskussionerna pekar på orsakerna till modellernas prestandaförändringar när de tillämpats på olika datamängder erhållna från olika datasplittringstekniker. Begränsningar påpekas också i slutet av studien för framtida förbättringar.
10

Ramosaj, Burim [Verfasser], Markus [Akademischer Betreuer] Pauly, and Jörg [Gutachter] Rahnenführer. "Analyzing consistency and statistical inference in Random Forest models / Burim Ramosaj ; Gutachter: Jörg Rahnenführer ; Betreuer: Markus Pauly." Dortmund : Universitätsbibliothek Dortmund, 2020. http://d-nb.info/1218781378/34.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
11

Kalmár, Marcus, and Joel Nilsson. "The art of forecasting – an analysis of predictive precision of machine learning models." Thesis, Uppsala universitet, Statistiska institutionen, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-280675.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Forecasting is used for decision making and unreliable predictions can instill a false sense of condence. Traditional time series modelling is astatistical art form rather than a science and errors can occur due to lim-itations of human judgment. In minimizing the risk of falsely specifyinga process the practitioner can make use of machine learning models. Inan eort to nd out if there's a benet in using models that require lesshuman judgment, the machine learning models Random Forest and Neural Network have been used to model a VAR(1) time series. In addition,the classical time series models AR(1), AR(2), VAR(1) and VAR(2) havebeen used as comparative foundation. The Random Forest and NeuralNetwork are trained and ultimately the models are used to make pre-dictions evaluated by RMSE. All models yield scattered forecast resultsexcept for the Random Forest that steadily yields comparatively precisepredictions. The study shows that there is denitive benet in using Random Forests to eliminate the risk of falsely specifying a process and do infact provide better results than a correctly specied model.
12

Wu, Shuang. "Algebraic area distribution of two-dimensional random walks and the Hofstadter model." Thesis, Université Paris-Saclay (ComUE), 2018. http://www.theses.fr/2018SACLS459/document.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Cette thèse porte sur le modèle de Hofstadter i.e., un électron qui se déplace sur un réseau carré couplé à un champ magnétique homogène et perpendiculaire au réseau. Son spectre en énergie est l'un des célèbres fractals de la physique quantique, connu sous le nom "le papillon de Hofstadter". Cette thèse consiste en deux parties principales: la première est l'étude du lien profond entre le modèle de Hofstadter et la distribution de l’aire algébrique entourée par les marches aléatoires sur un réseau carré bidimensionnel. La seconde partie se concentre sur les caractéristiques spécifiques du papillon de Hofstadter et l'étude de la largeur de bande du spectre. On a trouvé une formule exacte pour la trace de l'Hamiltonien de Hofstadter en termes des coefficients de Kreft, et également pour les moments supérieurs de la largeur de bande.Cette thèse est organisée comme suit. Dans le chapitre 1, on commence par la motivation de notre travail. Une introduction générale du modèle de Hofstadter ainsi que des marches aléatoires sera présentée. Dans le chapitre 2, on va montrer comment utiliser le lien entre les marches aléatoires et le modèle de Hofstadter. Une méthode de calcul de la fonction génératrice de l'aire algébrique entourée par les marches aléatoires planaires sera expliquée en détail. Dans le chapitre 3, on va présenter une autre méthode pour étudier ces questions en utilisant le point de vue "point spectrum traces" et retrouver la trace de Hofstadter complète. De plus, l'avantage de cette construction est qu'elle peut être généralisée au cas de "l'amost Mathieu opérateur". Dans le chapitre 4, on va introduire la méthode développée par D.J.Thouless pour le calcul de la largeur de bande du spectre de Hofstadter. En suivant la même logique, on va montrer comment généraliser la formule de la largeur de bande de Thouless à son n-ième moment, à définir plus précisément ultérieurement
This thesis is about the Hofstadter model, i.e., a single electron moving on a two-dimensional lattice coupled to a perpendicular homogeneous magnetic field. Its spectrum is one of the famous fractals in quantum mechanics, known as the Hofstadter's butterfly. There are two main subjects in this thesis: the first is the study of the deep connection between the Hofstadter model and the distribution of the algebraic area enclosed by two-dimensional random walks. The second focuses on the distinctive features of the Hofstadter's butterfly and the study of the bandwidth of the spectrum. We found an exact expression for the trace of the Hofstadter Hamiltonian in terms of the Kreft coefficients, and for the higher moments of the bandwidth.This thesis is organized as follows. In chapter 1, we begin with the motivation of our work and a general introduction to the Hofstadter model as well as to random walks will be presented. In chapter 2, we will show how to use the connection between random walks and the Hofstadter model. A method to calculate the generating function of the algebraic area distribution enclosed by planar random walks will be explained in details. In chapter 3, we will present another method to study these issues, by using the point spectrum traces to recover the full Hofstadter trace. Moreover, the advantage of this construction is that it can be generalized to the almost Mathieu operator. In chapter 4, we will introduce the method which was initially developed by D.J.Thouless to calculate the bandwidth of the Hofstadter spectrum. By following the same logic, I will show how to generalize the Thouless bandwidth formula to its n-th moment, to be more precisely defined later
13

Maginnity, Joseph D. "Comparing the Uses and Classification Accuracy of Logistic and Random Forest Models on an Adolescent Tobacco Use Dataset." The Ohio State University, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=osu1586997693789325.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
14

Poitevin, Caroline Myriam. "Non-random inter-specific encounters between Amazon understory forest birds : what are theyand how do they change." reponame:Biblioteca Digital de Teses e Dissertações da UFRGS, 2016. http://hdl.handle.net/10183/150626.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Os bandos mistos de aves são agregações sociais complexas estáveis durante o tempo e espaço. Até hoje, a estrutura social dessas espécies foi descrita a partir de estudos subjetivos de campo ou a partir de compilações do número e intensidade das interações a nível de todo o grupo, sem considerar as interações par-a-par individualmente. Nossos objetivos foram buscar evidências de associações não-randômicas entre pares de espécies de aves, delimitar os grupos a partir das espécies com as associações mais fortes e verificar se há diferenças na estrutura social entre os habitat de floresta primária e secundária. Utilizamos dados de ocorrência das espécies coletados a partir de redes de neblina e gravação de vocalizações para identificar pares de espécies que foram co-detectadas mais frequentemente do que o esperado a partir do modelo nulo e compararamos a força dessa interação entre as florestas tropicais primária e secundária Amazônicas. Nós também utilizamos as associações par-a-par para construir as redes de interação social e suas mudanças entre os tipos de habitat. Nós encontramos muitas interações positivas fortes entre as espécies, mas nenhuma evidência de repulsão. As análises das redes de interação revelaram vários grupos de espécies que corroboram com grupos ecológios descritos na literatura. Além disso, tanto a estrutura da rede de interação como a força da interação se alteraram drasticamente com a perturbação do habitat, com formação de algumas associações novas, mas uma tendência geral para quebra de associações entre as espécies. Nossos resultados mostram que as interações sociais entre essas aves podem ser fortemente afetados pela degradação do habitat, sugerindo que a estabilidade das interações desenvolvida entre espécies é ameaçada pelos distúrbios causados pelo homem.
Inter-specific associations of birds are complex social phenomena, frequently detected and often stable over time and space. So far, the social structure of these associations has been largely deduced from subjective assessments in the field or by counting the number of inter-specific encounters at the whole-group level, without considering changes to individual pairwise interactions. Here, we look for evidence of non-random association between pairs of bird species, delimit groups of more strongly associated species and examine differences in social structure between old growth and secondary forest habitat. We used records of bird species detection from mist-netting capture and from acoustic recordings to identify pairwise associations that were detected more frequently than expected under a null distribution, and compared the strength of these associations between old-growth and secondary forest Amazonian tropical forest. We also used the pairwise strength associations to visualize the social network structure and its changes between habitat types. We found many strongly positive interactions between species, but no evidence of repulsion. Network analyses revealed several modules of species that broadly agree with the subjective groupings described in the ornithological literature. Furthermore, both network structure and association strength changed drastically with habitat disturbance, with the formation of a few new associations but a general trend towards the breaking of associations between species. Our results show that social grouping in birds is real and may be strongly affected by habitat degradation, suggesting that the stability of the associations is threatened by anthropogenic disturbance.
15

Ospina, Arango Juan David. "Predictive models for side effects following radiotherapy for prostate cancer." Thesis, Rennes 1, 2014. http://www.theses.fr/2014REN1S046/document.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
La radiothérapie externe (EBRT en anglais pour External Beam Radiotherapy) est l'un des traitements référence du cancer de prostate. Les objectifs de la radiothérapie sont, premièrement, de délivrer une haute dose de radiations dans la cible tumorale (prostate et vésicules séminales) afin d'assurer un contrôle local de la maladie et, deuxièmement, d'épargner les organes à risque voisins (principalement le rectum et la vessie) afin de limiter les effets secondaires. Des modèles de probabilité de complication des tissus sains (NTCP en anglais pour Normal Tissue Complication Probability) sont nécessaires pour estimer sur les risques de présenter des effets secondaires au traitement. Dans le contexte de la radiothérapie externe, les objectifs de cette thèse étaient d'identifier des paramètres prédictifs de complications rectales et vésicales secondaires au traitement; de développer de nouveaux modèles NTCP permettant l'intégration de paramètres dosimétriques et de paramètres propres aux patients; de comparer les capacités prédictives de ces nouveaux modèles à celles des modèles classiques et de développer de nouvelles méthodologies d'identification de motifs de dose corrélés à l'apparition de complications. Une importante base de données de patients traités par radiothérapie conformationnelle, construite à partir de plusieurs études cliniques prospectives françaises, a été utilisée pour ces travaux. Dans un premier temps, la fréquence des symptômes gastro-Intestinaux et génito-Urinaires a été décrite par une estimation non paramétrique de Kaplan-Meier. Des prédicteurs de complications gastro-Intestinales et génito-Urinaires ont été identifiés via une autre approche classique : la régression logistique. Les modèles de régression logistique ont ensuite été utilisés dans la construction de nomogrammes, outils graphiques permettant aux cliniciens d'évaluer rapidement le risque de complication associé à un traitement et d'informer les patients. Nous avons proposé l'utilisation de la méthode d'apprentissage de machine des forêts aléatoires (RF en anglais pour Random Forests) pour estimer le risque de complications. Les performances de ce modèle incluant des paramètres cliniques et patients, surpassent celles des modèle NTCP de Lyman-Kutcher-Burman (LKB) et de la régression logistique. Enfin, la dose 3D a été étudiée. Une méthode de décomposition en valeurs populationnelles (PVD en anglais pour Population Value Decomposition) en 2D a été généralisée au cas tensoriel et appliquée à l'analyse d'image 3D. L'application de cette méthode à une analyse de population a été menée afin d'extraire un motif de dose corrélée à l'apparition de complication après EBRT. Nous avons également développé un modèle non paramétrique d'effets mixtes spatio-Temporels pour l'analyse de population d'images tridimensionnelles afin d'identifier une région anatomique dans laquelle la dose pourrait être corrélée à l'apparition d'effets secondaires
External beam radiotherapy (EBRT) is one of the cornerstones of prostate cancer treatment. The objectives of radiotherapy are, firstly, to deliver a high dose of radiation to the tumor (prostate and seminal vesicles) in order to achieve a maximal local control and, secondly, to spare the neighboring organs (mainly the rectum and the bladder) to avoid normal tissue complications. Normal tissue complication probability (NTCP) models are then needed to assess the feasibility of the treatment and inform the patient about the risk of side effects, to derive dose-Volume constraints and to compare different treatments. In the context of EBRT, the objectives of this thesis were to find predictors of bladder and rectal complications following treatment; to develop new NTCP models that allow for the integration of both dosimetric and patient parameters; to compare the predictive capabilities of these new models to the classic NTCP models and to develop new methodologies to identify dose patterns correlated to normal complications following EBRT for prostate cancer treatment. A large cohort of patient treated by conformal EBRT for prostate caner under several prospective French clinical trials was used for the study. In a first step, the incidence of the main genitourinary and gastrointestinal symptoms have been described. With another classical approach, namely logistic regression, some predictors of genitourinary and gastrointestinal complications were identified. The logistic regression models were then graphically represented to obtain nomograms, a graphical tool that enables clinicians to rapidly assess the complication risks associated with a treatment and to inform patients. This information can be used by patients and clinicians to select a treatment among several options (e.g. EBRT or radical prostatectomy). In a second step, we proposed the use of random forest, a machine-Learning technique, to predict the risk of complications following EBRT for prostate cancer. The superiority of the random forest NTCP, assessed by the area under the curve (AUC) of the receiving operative characteristic (ROC) curve, was established. In a third step, the 3D dose distribution was studied. A 2D population value decomposition (PVD) technique was extended to a tensorial framework to be applied on 3D volume image analysis. Using this tensorial PVD, a population analysis was carried out to find a pattern of dose possibly correlated to a normal tissue complication following EBRT. Also in the context of 3D image population analysis, a spatio-Temporal nonparametric mixed-Effects model was developed. This model was applied to find an anatomical region where the dose could be correlated to a normal tissue complication following EBRT
16

Ichard, Cécile. "Random media and processes estimation using non-linear filtering techniques : application to ensemble weather forecast and aircraft trajectories." Thesis, Toulouse 3, 2015. http://www.theses.fr/2015TOU30153/document.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
L'erreur de prédiction d'une trajectoire avion peut être expliquée par différents facteurs. Les incertitudes associées à la prévision météorologique sont l'un d'entre-eux. Qui plus est, l'erreur de prévision de vent a un effet non négligeable sur l'erreur de prédiction de la position d'un avion. En regardant le problème sous un autre angle, il s'avère que les avions peuvent être utilisés comme des capteurs locaux pour estimer l'erreur de prévision de vent. Dans ce travail nous décrivons ce problème d'estimation à l'aide de plusieurs processus d'acquisition d'un même champ aléatoire. Quand ce champ est homogène, nous montrons que le problème est équivalent à plusieurs processus aléatoires évoluant dans un même environnement aléatoire pour lequel nous donnons un modèle de Feynman-Kac. Nous en dérivons une approximation particulaire et fournissons pour les estimateurs obtenus des résultats de convergence. Quand le champ n'est pas homogène mais qu'une décomposition en sous-domaine homogène est possible, nous proposons un modèle différent basé sur le couplage de plusieurs processus d'acquisition. Nous en déduisons un modèle de Feynman-Kac et suggérons une approximation particulaire du flot de mesure. Par ailleurs, pour pouvoir traiter un trafic aérien, nous développons un modèle de prédiction de trajectoire avion. Finalement nous démontrons dans le cadre de simulations que nos algorithmes peuvent estimer les erreurs de prévisions de vent en utilisant les observations délivrées par les avions le long de leur trajectoire
Aircraft trajectory prediction error can be explained by different factors. One of them is the weather forecast uncertainties. For example, the wind forecast error has a non negligible impact on the along track accuracy for the predicted aircraft position. From a different perspective, that means that aircrafts can be used as local sensors to estimate the weather forecast error. In this work we describe the estimation problem as several acquisition processes of a same random field. When the field is homogeneous, we prove that they are equivalent to random processes evolving in a random media for which a Feynman-Kac formulation is done. Then we give a particle-based approximation and provide convergence results of the ensuing estimators. When the random field is not homogeneous but can be decomposed in homogeneous sub-domains, a different model is proposed based on the coupling of different acquisition processes. From there, a Feynman-Kac formulation is derived and its particle-based approximation is suggested. Furthermore, we develop an aircraft trajectory prediction model. Finally we demonstrate on a simulation set-up that our algorithms can estimate the wind forecast errors using the aircraft observations delivered along their trajectory
17

Rusu, Corneliu. "Risk Factors for Suicidal Behaviour Among Canadian Civilians and Military Personnel: A Recursive Partitioning Approach." Thesis, Université d'Ottawa / University of Ottawa, 2018. http://hdl.handle.net/10393/37371.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Background: Suicidal behaviour is a major public health problem that has not abated over the past decade. Adopting machine learning algorithms that allow for combining risk factors that may increase the predictive accuracy of models of suicide behaviour is one promising avenue toward effective prevention and treatment. Methods: We used Canadian Community Health Survey – Mental Health and Canadian Forces Mental Health Survey to build conditional inference random forests models of suicidal behaviour in Canadian general population and Canadian Armed Forces. We generated risk algorithms for suicidal behaviour in each sample. We performed within- and between-sample validation and reported the corresponding performance metrics. Results: Only a handful of variables were important in predicting suicidal behaviour in Canadian general population and Canadian Armed Forces. Each model’s performance on within-sample validation was satisfactory, with moderate to high sensitivity and high specificity, while the performance on between-sample validation was conditional on the size and heterogeneity of the training sample. Conclusion: Using conditional inference random forest methodology on large nationally representative mental health surveys has the potential of generating models of suicidal behaviour that not only reflect its complex nature, but indicate that the true positive cases are likely to be captured by this approach.
18

Abud, Luciana de Melo e. "Modelos computacionais prognósticos de lesões traumáticas do plexo braquial em adultos." Universidade de São Paulo, 2018. http://www.teses.usp.br/teses/disponiveis/45/45134/tde-20082018-140641/.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Estudos de prognóstico clínico consistem na predição do curso de uma doença em pacientes e são utilizados por profissionais da saúde com o intuito de aumentar as chances ou a qualidade de sua recuperação. Sob a perspectiva computacional, a criação de um modelo prognóstico clínico é um problema de classificação, cujo objetivo é identificar a qual classe (dentro de um conjunto de classes predefinidas) uma nova amostra pertence. Este projeto visa a criar modelos prognósticos de lesões traumáticas do plexo braquial, um conjunto de nervos que inervam os membros superiores, utilizando dados de pacientes adultos com esse tipo de lesão. Os dados são provenientes do Instituto de Neurologia Deolindo Couto (INDC) da Universidade Federal do Rio de Janeiro (UFRJ) e contêm dezenas de atributos clínicos coletados por meio de questionários eletrônicos. Com esses modelos prognósticos, deseja-se identificar de maneira automática os possíveis preditores do curso desse tipo de lesão. Árvores de decisão são classificadores frequentemente utilizados para criação de modelos prognósticos, por se tratarem de um modelo transparente, cujo resultado pode ser examinado e interpretado clinicamente. As Florestas Aleatórias, uma técnica que utiliza um conjunto de árvores de decisão para determinar o resultado final da classificação, podem aumentar significativamente a acurácia e a generalização dos modelos gerados, entretanto ainda são pouco utilizadas na criação de modelos prognósticos. Neste projeto, exploramos a utilização de florestas aleatórias nesse contexto, bem como a aplicação de métodos de interpretação de seus modelos gerados, uma vez que a transparência do modelo é um aspecto particularmente importante em domínios clínicos. A estimativa de generalização dos modelos resultantes foi feita por meio de métodos que viabilizam sua utilização sobre um número reduzido de instâncias, uma vez que os dados relativos ao prognóstico são provenientes de 44 pacientes do INDC. Além disso, adaptamos a técnica de florestas aleatórias para incluir a possível existência de valores faltantes, que é uma característica presente nos dados utilizados neste projeto. Foram criados quatro modelos prognósticos - um para cada objetivo de recuperação, sendo eles a ausência de dor e forças satisfatórias avaliadas sobre abdução do ombro, flexão do cotovelo e rotação externa no ombro. As acurácias dos modelos foram estimadas entre 77% e 88%, utilizando o método de validação cruzada leave-one-out. Esses modelos evoluirão com a inclusão de novos dados, provenientes da contínua chegada de novos pacientes em tratamento no INDC, e serão utilizados como parte de um sistema de apoio à decisão clínica, de forma a possibilitar a predição de recuperação de um paciente considerando suas características clínicas.
Studies of prognosis refer to the prediction of the course of a disease in patients and are employed by health professionals in order to improve patients\' recovery chances and quality. Under a computational perspective, the creation of a prognostic model is a classification task that aims to identify to which class (within a predefined set of classes) a new sample belongs. The goal of this project is the creation of prognostic models for traumatic injuries of the brachial plexus, a network of nerves that innervates the upper limbs, using data from adult patients with this kind of injury. The data come from the Neurology Institute Deolindo Couto (INDC) of Rio de Janeiro Federal University (UFRJ) and they are characterized by dozens of clinical features that are collected by means of electronic questionnaires. With the use of these prognostic models we intended to automatically identify possible predictors of the course of brachial plexus injuries. Decision trees are classifiers that are frequently used for the creation of prognostic models since they are a transparent technique that produces results that can be clinically examined and interpreted. Random Forests are a technique that uses a set of decision trees to determine the final classification results and can significantly improve model\'s accuracy and generalization, yet they are still not commonly used for the creation of prognostic models. In this project we explored the use of random forests for that purpose, as well as the use of interpretation methods for the resulting models, since model transparency is an important aspect in clinical domains. Model assessment was achieved by means of methods whose application over a small set of samples is suitable, since the available prognostic data refer to only 44 patients from INDC. Additionally, we adapted the random forests technique to include missing data, that are frequent among the data used in this project. Four prognostic models were created - one for each recovery goal, those being absence of pain and satisfactory strength evaluated over shoulder abduction, elbow flexion and external shoulder rotation. The models\' accuracies were estimated between 77% and 88%, calculated through the leave-one-out cross validation method. These models will evolve with the inclusion of new data from new patients that will arrive at the INDC and they will be used as part of a clinical decision support system, with the purpose of prediction of a patient\'s recovery considering his or her clinical characteristics.
19

Lundström, Love, and Oscar Öhman. "Machine Learning in credit risk : Evaluation of supervised machine learning models predicting credit risk in the financial sector." Thesis, Umeå universitet, Institutionen för matematik och matematisk statistik, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-164101.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
When banks lend money to another party they face a risk that the borrower will not fulfill its obligation towards the bank. This risk is called credit risk and it’s the largest risk banks faces. According to the Basel accord banks need to have a certain amount of capital requirements to protect themselves towards future financial crisis. This amount is calculated for each loan with an attached risk-weighted asset, RWA. The main parameters in RWA is probability of default and loss given default. Banks are today allowed to use their own internal models to calculate these parameters. Thus hold capital with no gained interest is a great cost, banks seek to find tools to better predict probability of default to lower the capital requirement. Machine learning and supervised algorithms such as Logistic regression, Neural network, Decision tree and Random Forest can be used to decide credit risk. By training algorithms on historical data with known results the parameter probability of default (PD) can be determined with a higher certainty degree compared to traditional models, leading to a lower capital requirement. On the given data set in this article Logistic regression seems to be the algorithm with highest accuracy of classifying customer into right category. However, it classifies a lot of people as false positive meaning the model thinks a customer will honour its obligation but in fact the customer defaults. Doing this comes with a great cost for the banks. Through implementing a cost function to minimize this error, we found that the Neural network has the lowest false positive rate and will therefore be the model that is best suited for this specific classification task.
När banker lånar ut pengar till en annan part uppstår en risk i att låntagaren inte uppfyller sitt antagande mot banken. Denna risk kallas för kredit risk och är den största risken en bank står inför. Enligt Basel föreskrifterna måste en bank avsätta en viss summa kapital för varje lån de ger ut för att på så sätt skydda sig emot framtida finansiella kriser. Denna summa beräknas fram utifrån varje enskilt lån med tillhörande risk-vikt, RWA. De huvudsakliga parametrarna i RWA är sannolikheten att en kund ej kan betala tillbaka lånet samt summan som banken då förlorar. Idag kan banker använda sig av interna modeller för att estimera dessa parametrar. Då bundet kapital medför stora kostnader för banker, försöker de sträva efter att hitta bättre verktyg för att uppskatta sannolikheten att en kund fallerar för att på så sätt minska deras kapitalkrav. Därför har nu banker börjat titta på möjligheten att använda sig av maskininlärningsalgoritmer för att estimera dessa parametrar. Maskininlärningsalgoritmer såsom Logistisk regression, Neurala nätverk, Beslutsträd och Random forest, kan användas för att bestämma kreditrisk. Genom att träna algoritmer på historisk data med kända resultat kan parametern, chansen att en kund ej betalar tillbaka lånet (PD), bestämmas med en högre säkerhet än traditionella metoder. På den givna datan som denna uppsats bygger på visar det sig att Logistisk regression är den algoritm med högst träffsäkerhet att klassificera en kund till rätt kategori. Däremot klassifiserar denna algoritm många kunder som falsk positiv vilket betyder att den predikterar att många kunder kommer betala tillbaka sina lån men i själva verket inte betalar tillbaka lånet. Att göra detta medför en stor kostnad för bankerna. Genom att istället utvärdera modellerna med hjälp av att införa en kostnadsfunktion för att minska detta fel finner vi att Neurala nätverk har den lägsta falsk positiv ration och kommer därmed vara den model som är bäst lämpad att utföra just denna specifika klassifierings uppgift.
20

Svensson, William. "CAN STATISTICAL MODELS BEAT BENCHMARK PREDICTIONS BASED ON RANKINGS IN TENNIS?" Thesis, Uppsala universitet, Statistiska institutionen, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-447384.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
The aim of this thesis is to beat a benchmark prediction of 64.58 percent based on player rankings on the ATP tour in tennis. That means that the player with the best rank in a tennis match is deemed as the winner. Three statistical model are used, logistic regression, random forest and XGBoost. The data are over a period between the years 2000-2010 and has over 60 000 observations with 49 variables each. After the data was prepared, new variables were created and the difference between the two players in hand taken all three statistical models did outperform the benchmark prediction. All three variables had an accuracy around 66 percent with the logistic regression performing the best with an accuracy of 66.45 percent. The most important variable overall for the models is the total win rate on different surfaces, the total win rate and rank.
21

Barth, Danielle. "To HAVE and to BE: Function Word Reduction in Child Speech, Child Directed Speech and Inter-adult Speech." Thesis, University of Oregon, 2016. http://hdl.handle.net/1794/19687.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Function words are known to be shorter than content words. I investigate the function words BE and HAVE (with its content word homonym) and show that more reduction, operationalized as word shortening or contraction, is found in some grammaticalized meanings of these words. The difference between the words’ uses cannot be attributed to differences in frequency or semantic weight. Instead I argue that these words are often shortened and reduced when they occur in constructions in which they are highly predictable. This suggests that particular grammaticalized uses of a word are stored with their own exemplar clouds of context-specific phonetic realizations. The phonetics of any instance of a word are then jointly determined by the exemplar cloud for that word and the particular context. A given instance of an auxiliary can be reduced either because it is predictable in the current context or because that use of the auxiliary usually occurs in predictable contexts. The effects cannot be attributed to frequency or semantic weight. The present study compares function word production in the speech of school-aged children and their caregivers and in inter-adult speech. The effects of predictability in context and average predictability across contexts are replicated across the datasets. However, I find that as children get older their function words shorten relative to content words, even when controlling for increasing speech rate, showing that as their language experience increases they spend less time where it is not needed for comprehensibility. Caregivers spend less time on function words with older children than younger children, suggesting that they expect function words to be more difficult for younger interlocutors to decode than for older interlocutors. Additionally, while adults use either word shortening or contraction to increase the efficiency of speech, children tend to either use contraction and word shortening or neither until age seven, where they start to use one strategy or the other like adults. Young children with better vocabulary employ an adult-like strategy earlier, suggesting earlier onset of efficient yet effective speech behavior, namely allocating less signal to function words when they are especially easy for the listener to decode.
22

Liu, Xiaoyang. "Machine Learning Models in Fullerene/Metallofullerene Chromatography Studies." Thesis, Virginia Tech, 2019. http://hdl.handle.net/10919/93737.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Machine learning methods are now extensively applied in various scientific research areas to make models. Unlike regular models, machine learning based models use a data-driven approach. Machine learning algorithms can learn knowledge that are hard to be recognized, from available data. The data-driven approaches enhance the role of algorithms and computers and then accelerate the computation using alternative views. In this thesis, we explore the possibility of applying machine learning models in the prediction of chromatographic retention behaviors. Chromatographic separation is a key technique for the discovery and analysis of fullerenes. In previous studies, differential equation models have achieved great success in predictions of chromatographic retentions. However, most of the differential equation models require experimental measurements or theoretical computations for many parameters, which are not easy to obtain. Fullerenes/metallofullerenes are rigid and spherical molecules with only carbon atoms, which makes the predictions of chromatographic retention behaviors as well as other properties much simpler than other flexible molecules that have more variations on conformations. In this thesis, I propose the polarizability of a fullerene molecule is able to be estimated directly from the structures. Structural motifs are used to simplify the model and the models with motifs provide satisfying predictions. The data set contains 31947 isomers and their polarizability data and is split into a training set with 90% data points and a complementary testing set. In addition, a second testing set of large fullerene isomers is also prepared and it is used to testing whether a model can be trained by small fullerenes and then gives ideal predictions on large fullerenes.
Machine learning models are capable to be applied in a wide range of areas, such as scientific research. In this thesis, machine learning models are applied to predict chromatography behaviors of fullerenes based on the molecular structures. Chromatography is a common technique for mixture separations, and the separation is because of the difference of interactions between molecules and a stationary phase. In real experiments, a mixture usually contains a large family of different compounds and it requires lots of work and resources to figure out the target compound. Therefore, models are extremely import for studies of chromatography. Traditional models are built based on physics rules, and involves several parameters. The physics parameters are measured by experiments or theoretically computed. However, both of them are time consuming and not easy to be conducted. For fullerenes, in my previous studies, it has been shown that the chromatography model can be simplified and only one parameter, polarizability, is required. A machine learning approach is introduced to enhance the model by predicting the molecular polarizabilities of fullerenes based on structures. The structure of a fullerene is represented by several local structures. Several types of machine learning models are built and tested on our data set and the result shows neural network gives the best predictions.
23

Galleguillos, Aguilar Matías. "Desarrollo de un modelo predictivo de deserción de estudiantes de primer año en institución de educación superior." Tesis, Universidad de Chile, 2018. http://repositorio.uchile.cl/handle/2250/170006.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Memoria para optar al título de Ingeniero Civil Eléctrico
En Chile, durante los últimos 30 años ha habido un crecimiento significativo en el acceso de las personas a la educación superior. Acompañado de este crecimiento se ha visto un aumento en la deserción universitaria, siendo particularmente elevada la de alumnos de primer año. Este problema tiene grandes costos de distinta índole tanto para los alumnos como para las universidades, haciendo que se haya posicionado como una de las métricas más importantes que se utiliza para acreditar a las instituciones. La Universidad de las Américas se ha visto enfrentada a una alta tasa de deserción, traduciéndose en que en el año 2013 haya contribuido de manera importante a la pérdida de su acreditación, por lo que se transformó en tema prioritario a resolver. Por esto se ideó un plan para ayudar a los alumnos con mayor probabilidad de desertar. Actualmente UDLA no posee un sistema automatizado que clasifique a los alumnos en base a análisis de datos de su comportamiento, sólo se cuenta con un sistema de reglas creado en base al conocimiento de deserción de miembros de la universidad, por lo que tiene una alta tasa de errores. En el último estudio publicado por el Servicio de Información de Educación Superior sobre retención de alumnos de primer año, construido con datos de alumnos que ingresaron a estudiar el año 2016, la Universidad de las Américas se ubica en la posición 47 de 58 universidades. Por esto, desarrollar un sistema capaz de identificar a los alumnos que estén en riesgo de desertar sigue siendo un tema prioritario para la institución. El objetivo del presente trabajo es desarrollar un sistema capaz de entregar un índice de riesgo de deserción de cada alumno de primer año. Para esto se propone plantear el proceso de asignar riesgo como un problema de clasificación y afrontarlo con herramientas de inteligencia computacional. Para resolver el problema se dividió el semestre en tramos y se entrenó un modelo para cada uno de éstos. La precisión del primer modelo fue más baja que la de estudios similares que afrontaron el mismo problema en otras universidades del mundo, teniendo un 70,1% de aciertos. El modelo de cada tramo entregó mejores resultados que los del tramo anterior, siendo el del final del semestre el de mejores resultados llegando a un 82,5% de precisión, lo que se asemeja a otros trabajos.
24

Sun, Wangru. "Modèle de forêts enracinées sur des cycles et modèle de perles via les dimères." Thesis, Sorbonne université, 2018. http://www.theses.fr/2018SORUS007/document.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Le modèle de dimères, également connu sous le nom de modèle de couplage parfait, est un modèle probabiliste introduit à l'origine dans la mécanique statistique. Une configuration de dimères d'un graphe est un sous-ensemble des arêtes tel que chaque sommet est incident à exactement une arête. Un poids est attribué à chaque arête et la probabilité d'une configuration est proportionnelle au produit des poids des arêtes présentes. Dans cette thèse, nous étudions principalement deux modèles qui sont liés au modèle de dimères, et plus particulièrement leur comportements limites. Le premier est le modèle des forêts couvrantes enracinées sur des cycles (CRSF) sur le tore, qui sont en bijection avec les configurations de dimères via la bijection de Temperley. Dans la limite quand la taille du tore tend vers l'infini, la mesure sur les CRSF converge vers une mesure de Gibbs ergodique sur le plan tout entier. Nous étudions la connectivité de l'objet limite, prouvons qu'elle est déterminée par le changement de hauteur moyen de la mesure de Gibbs ergodique et donnons un diagramme de phase. Le second est le modèle de perles, un processus ponctuel sur $\mathbb{Z}\times\mathbb{R}$ qui peut être considéré comme une limite à l'échelle du modèle de dimères sur un réseau hexagonal. Nous formulons et prouvons un principe variationnel similaire à celui du modèle dimère \cite{CKP01}, qui indique qu'à la limite de l'échelle, la fonction de hauteur normalisée d'une configuration de perles converge en probabilité vers une surface $h_0$ qui maximise une certaine fonctionnelle qui s'appelle "entropie". Nous prouvons également que la forme limite $h_0$ est une limite de l'échelle des formes limites de modèles de dimères. Il existe une correspondance entre configurations de perles et (skew) tableaux de Young standard, qui préserve la mesure uniforme sur les deux ensembles. Le principe variationnel du modèle de perles implique une forme limite d'un tableau de Young standard aléatoire. Ce résultat généralise celui de \cite{PR}. Nous dérivons également l'existence d'une courbe arctique d'un processus ponctuel discret qui encode les tableaux standard, defini dans \cite{Rom}
The dimer model, also known as the perfect matching model, is a probabilistic model originally introduced in statistical mechanics. A dimer configuration of a graph is a subset of the edges such that every vertex is incident to exactly one edge of the subset. A weight is assigned to every edge, and the probability of a configuration is proportional to the product of the weights of the edges present. In this thesis we mainly study two related models and in particular their limiting behavior. The first one is the model of cycle-rooted-spanning-forests (CRSF) on tori, which is in bijection with toroidal dimer configurations via Temperley's bijection. This gives rise to a measure on CRSF. In the limit that the size of torus tends to infinity, the CRSF measure tends to an ergodic Gibbs measure on the whole plane. We study the connectivity property of the limiting object, prove that it is determined by the average height change of the limiting ergodic Gibbs measure and give a phase diagram. The second one is the bead model, a random point field on $\mathbb{Z}\times\mathbb{R}$ which can be viewed as a scaling limit of dimer model on a hexagon lattice. We formulate and prove a variational principle similar to that of the dimer model \cite{CKP01}, which states that in the scaling limit, the normalized height function of a uniformly chosen random bead configuration lies in an arbitrarily small neighborhood of a surface $h_0$ that maximizes some functional which we call as entropy. We also prove that the limit shape $h_0$ is a scaling limit of the limit shapes of a properly chosen sequence of dimer models. There is a map form bead configurations to standard tableaux of a (skew) Young diagram, and the map is measure preserving if both sides take uniform measures. The variational principle of the bead model yields the existence of the limit shape of a random standard Young tableau, which generalizes the result of \cite{PR}. We derive also the existence of an arctic curve of a discrete point process that encodes the standard tableaux, raised in \cite{Rom}
25

Dinger, Steven. "Essays on Reinforcement Learning with Decision Trees and Accelerated Boosting of Partially Linear Additive Models." University of Cincinnati / OhioLINK, 2019. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1562923541849035.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
26

Zhang, Qing Frankowski Ralph. "An empirical evaluation of the random forests classifier models for variable selection in a large-scale lung cancer case-control study /." See options below, 2006. http://proquest.umi.com/pqdweb?did=1324365481&sid=1&Fmt=2&clientId=68716&RQT=309&VName=PQD.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
27

Palczewska, Anna Maria. "Interpretation, Identification and Reuse of Models. Theory and algorithms with applications in predictive toxicology." Thesis, University of Bradford, 2014. http://hdl.handle.net/10454/7349.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
This thesis is concerned with developing methodologies that enable existing models to be effectively reused. Results of this thesis are presented in the framework of Quantitative Structural-Activity Relationship (QSAR) models, but their application is much more general. QSAR models relate chemical structures with their biological, chemical or environmental activity. There are many applications that offer an environment to build and store predictive models. Unfortunately, they do not provide advanced functionalities that allow for efficient model selection and for interpretation of model predictions for new data. This thesis aims to address these issues and proposes methodologies for dealing with three research problems: model governance (management), model identification (selection), and interpretation of model predictions. The combination of these methodologies can be employed to build more efficient systems for model reuse in QSAR modelling and other areas. The first part of this study investigates toxicity data and model formats and reviews some of the existing toxicity systems in the context of model development and reuse. Based on the findings of this review and the principles of data governance, a novel concept of model governance is defined. Model governance comprises model representation and model governance processes. These processes are designed and presented in the context of model management. As an application, minimum information requirements and an XML representation for QSAR models are proposed. Once a collection of validated, accepted and well annotated models is available within a model governance framework, they can be applied for new data. It may happen that there is more than one model available for the same endpoint. Which one to chose? The second part of this thesis proposes a theoretical framework and algorithms that enable automated identification of the most reliable model for new data from the collection of existing models. The main idea is based on partitioning of the search space into groups and assigning a single model to each group. The construction of this partitioning is difficult because it is a bi-criteria problem. The main contribution in this part is the application of Pareto points for the search space partition. The proposed methodology is applied to three endpoints in chemoinformatics and predictive toxicology. After having identified a model for the new data, we would like to know how the model obtained its prediction and how trustworthy it is. An interpretation of model predictions is straightforward for linear models thanks to the availability of model parameters and their statistical significance. For non linear models this information can be hidden inside the model structure. This thesis proposes an approach for interpretation of a random forest classification model. This approach allows for the determination of the influence (called feature contribution) of each variable on the model prediction for an individual data. In this part, there are three methods proposed that allow analysis of feature contributions. Such analysis might lead to the discovery of new patterns that represent a standard behaviour of the model and allow additional assessment of the model reliability for new data. The application of these methods to two standard benchmark datasets from the UCI machine learning repository shows a great potential of this methodology. The algorithm for calculating feature contributions has been implemented and is available as an R package called rfFC.
28

Lanka, Venkata Raghava Ravi Teja Lanka. "VEHICLE RESPONSE PREDICTION USING PHYSICAL AND MACHINE LEARNING MODELS." The Ohio State University, 2017. http://rave.ohiolink.edu/etdc/view?acc_num=osu1511891682062084.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
29

Appelquist, Niklas, and Emelia Karlsson. "Kan en bättre prediktion uppnås genom en kategorispecifik modell? : Teknologiprojekt på Kickstarter och maskininlärning." Thesis, Uppsala universitet, Institutionen för informatik och media, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-413736.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Crowdfunding används för att samla in pengar för tänkta projekt via internet, där ett stort antal investerare bidrar med små summor. Kickstarter är en av de största crowdfundingplattformarna idag. Trots det stora intresset för crowdfunding misslyckas många kampanjer att nå sin målsumma och projekt av kategorin teknologi visar sig vara de projekt som misslyckas till högst grad. Därmed är det av intresse att kunna förutsäga vilka kampanjer som kommer att lyckas eller misslyckas. Denna forskningsansats syftar till att undersöka genomförbarheten i att uppnå en högre accuracy vid prediktion av framgången hos lanserade kickstarterprojekt med hjälp av maskininlärning genom att använda en mindre mängd kategorispecifik data. Data över 192 548 lanserade projekt på plattformen Kickstarter har samlats in via www.kaggle.com. Två modeller av typen RandomForest har sedan tränats där en modell tränades med data över samtliga projekt i uppsättningen och en tränades med data över teknologiprojekt med syftet att kunna jämföra modellernas prestation vid klassificering av teknologiprojekt. Resultatet visar att en högre accuracy uppmättes för teknologimodellen som nådde 68,37% träffsäkerhet vid klassificeringen gentemot referensmodellens uppvisade accuracy på 68,00%.
Crowdfunding is used to collect money via internet for potential projects through a large number of backers which contribute with small pledges. Kickstarter is one of the largest crowdfunding platforms today. Despite the big interest in crowdfunding a lot of launched campaigns fail to reach their goal and projects of the category technology shows the largest rate of failure on Kickstarter. Therefore, it is important to be able to predict which campaigns are likely to succeed or fail. This thesis aims to explore the possibility of reaching a higher accuracy when predicting the success of launched projects with machine learning with a smaller amount of category-specific data. The data consists om 192 548 launched projects on Kickstarter and has been collected through Kaggle.com. Two models of the type Random Forest has been developed where one model has been trained with general data over all projects and one model has been trained with category specific data over technology projects. The results show that the technology model show a higher accuracy rate with 68,37 % compared to the reference model with 68,00 %.
30

Raynal, Louis. "Bayesian statistical inference for intractable likelihood models." Thesis, Montpellier, 2019. http://www.theses.fr/2019MONTS035/document.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Dans un processus d’inférence statistique, lorsque le calcul de la fonction de vraisemblance associée aux données observées n’est pas possible, il est nécessaire de recourir à des approximations. C’est un cas que l’on rencontre très fréquemment dans certains champs d’application, notamment pour des modèles de génétique des populations. Face à cette difficulté, nous nous intéressons aux méthodes de calcul bayésien approché (ABC, Approximate Bayesian Computation) qui se basent uniquement sur la simulation de données, qui sont ensuite résumées et comparées aux données observées. Ces comparaisons nécessitent le choix judicieux d’une distance, d’un seuil de similarité et d’un ensemble de résumés statistiques pertinents et de faible dimension.Dans un contexte d’inférence de paramètres, nous proposons une approche mêlant des simulations ABC et les méthodes d’apprentissage automatique que sont les forêts aléatoires. Nous utilisons diverses stratégies pour approximer des quantités a posteriori d’intérêts sur les paramètres. Notre proposition permet d’éviter les problèmes de réglage liés à l’ABC, tout en fournissant de bons résultats ainsi que des outils d’interprétation pour les praticiens. Nous introduisons de plus des mesures d’erreurs de prédiction a posteriori (c’est-à-dire conditionnellement à la donnée observée d’intérêt) calculées grâce aux forêts. Pour des problèmes de choix de modèles, nous présentons une stratégie basée sur des groupements de modèles qui permet, en génétique des populations, de déterminer dans un scénario évolutif les évènements plus ou moins bien identifiés le constituant. Toutes ces approches sont implémentées dans la bibliothèque R abcrf. Par ailleurs, nous explorons des manières de construire des forêts aléatoires dites locales, qui prennent en compte l’observation à prédire lors de leur phase d’entraînement pour fournir une meilleure prédiction. Enfin, nous présentons deux études de cas ayant bénéficié de nos développements, portant sur la reconstruction de l’histoire évolutive de population pygmées, ainsi que de deux sous-espèces du criquet pèlerin Schistocerca gregaria
In a statistical inferential process, when the calculation of the likelihood function is not possible, approximations need to be used. This is a fairly common case in some application fields, especially for population genetics models. Toward this issue, we are interested in approximate Bayesian computation (ABC) methods. These are solely based on simulated data, which are then summarised and compared to the observed ones. The comparisons are performed depending on a distance, a similarity threshold and a set of low dimensional summary statistics, which must be carefully chosen.In a parameter inference framework, we propose an approach combining ABC simulations and the random forest machine learning algorithm. We use different strategies depending on the parameter posterior quantity we would like to approximate. Our proposal avoids the usual ABC difficulties in terms of tuning, while providing good results and interpretation tools for practitioners. In addition, we introduce posterior measures of error (i.e., conditionally on the observed data of interest) computed by means of forests. In a model choice setting, we present a strategy based on groups of models to determine, in population genetics, which events of an evolutionary scenario are more or less well identified. All these approaches are implemented in the R package abcrf. In addition, we investigate how to build local random forests, taking into account the observation to predict during their learning phase to improve the prediction accuracy. Finally, using our previous developments, we present two case studies dealing with the reconstruction of the evolutionary history of Pygmy populations, as well as of two subspecies of the desert locust Schistocerca gregaria
31

Olofsson, Nina. "A Machine Learning Ensemble Approach to Churn Prediction : Developing and Comparing Local Explanation Models on Top of a Black-Box Classifier." Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-210565.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Churn prediction methods are widely used in Customer Relationship Management and have proven to be valuable for retaining customers. To obtain a high predictive performance, recent studies rely on increasingly complex machine learning methods, such as ensemble or hybrid models. However, the more complex a model is, the more difficult it becomes to understand how decisions are actually made. Previous studies on machine learning interpretability have used a global perspective for understanding black-box models. This study explores the use of local explanation models for explaining the individual predictions of a Random Forest ensemble model. The churn prediction was studied on the users of Tink – a finance app. This thesis aims to take local explanations one step further by making comparisons between churn indicators of different user groups. Three sets of groups were created based on differences in three user features. The importance scores of all globally found churn indicators were then computed for each group with the help of local explanation models. The results showed that the groups did not have any significant differences regarding the globally most important churn indicators. Instead, differences were found for globally less important churn indicators, concerning the type of information that users stored in the app. In addition to comparing churn indicators between user groups, the result of this study was a well-performing Random Forest ensemble model with the ability of explaining the reason behind churn predictions for individual users. The model proved to be significantly better than a number of simpler models, with an average AUC of 0.93.
Metoder för att prediktera utträde är vanliga inom Customer Relationship Management och har visat sig vara värdefulla när det kommer till att behålla kunder. För att kunna prediktera utträde med så hög säkerhet som möjligt har den senasteforskningen fokuserat på alltmer komplexa maskininlärningsmodeller, såsom ensembler och hybridmodeller. En konsekvens av att ha alltmer komplexa modellerär dock att det blir svårare och svårare att förstå hur en viss modell har kommitfram till ett visst beslut. Tidigare studier inom maskininlärningsinterpretering har haft ett globalt perspektiv för att förklara svårförståeliga modeller. Denna studieutforskar lokala förklaringsmodeller för att förklara individuella beslut av en ensemblemodell känd som 'Random Forest'. Prediktionen av utträde studeras påanvändarna av Tink – en finansapp. Syftet med denna studie är att ta lokala förklaringsmodeller ett steg längre genomatt göra jämförelser av indikatorer för utträde mellan olika användargrupper. Totalt undersöktes tre par av grupper som påvisade skillnader i tre olika variabler. Sedan användes lokala förklaringsmodeller till att beräkna hur viktiga alla globaltfunna indikatorer för utträde var för respektive grupp. Resultaten visade att detinte fanns några signifikanta skillnader mellan grupperna gällande huvudindikatorerna för utträde. Istället visade resultaten skillnader i mindre viktiga indikatorer som hade att göra med den typ av information som lagras av användarna i appen. Förutom att undersöka skillnader i indikatorer för utträde resulterade dennastudie i en välfungerande modell för att prediktera utträde med förmågan attförklara individuella beslut. Random Forest-modellen visade sig vara signifikantbättre än ett antal enklare modeller, med ett AUC-värde på 0.93.
32

Duroux, Roxane. "Inférence pour les modèles statistiques mal spécifiés, application à une étude sur les facteurs pronostiques dans le cancer du sein." Thesis, Paris 6, 2016. http://www.theses.fr/2016PA066224/document.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Cette thèse est consacrée à l'inférence de certains modèles statistiques mal spécifiés. Chaque résultat obtenu trouve son application dans une étude sur les facteurs pronostiques dans le cancer du sein, grâce à des données collectées par l'Institut Curie. Dans un premier temps, nous nous intéressons au modèle à risques non proportionnels, et exploitons la connaissance de la survie marginale du temps de décès. Ce modèle autorise la variation dans le temps du coefficient de régression, généralisant ainsi le modèle à hasards proportionnels. Dans un deuxième temps, nous étudions un modèle à hasards non proportionnels ayant un coefficient de régression constant par morceaux. Nous proposons une méthode d'inférence pour un modèle à un unique point de rupture, et une méthode d'estimation pour un modèle à plusieurs points de rupture. Dans un troisième temps, nous étudions l'influence du sous-échantillonnage sur la performance des forêts médianes et essayons de généraliser les résultats obtenus aux forêts aléatoires de survie à travers une application. Enfin, nous présentons un travail indépendant où nous développons une nouvelle méthode de recherche de doses, dans le cadre des essais cliniques de phase I à ordre partiel
The thesis focuses on inference of statistical misspecified models. Every result finds its application in a prognostic factors study for breast cancer, thanks to the data collection of Institut Curie. We consider first non-proportional hazards models, and make use of the marginal survival of the failure time. This model allows a time-varying regression coefficient, and therefore generalizes the proportional hazards model. On a second time, we study step regression models. We propose an inference method for the changepoint of a two-step regression model, and an estimation method for a multiple-step regression model. Then, we study the influence of the subsampling rate on the performance of median forests and try to extend the results to random survival forests through an application. Finally, we present a new dose-finding method for phase I clinical trials, in case of partial ordering
33

Geylan, Gökçe. "Training Machine Learning-based QSAR models with Conformal Prediction on Experimental Data from DNA-Encoded Chemical Libraries." Thesis, Uppsala universitet, Institutionen för farmaceutisk biovetenskap, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-447354.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
DNA-encoded chemical libraries (DEL) allows an exhaustive chemical space sampling with a large-scale data consisting of compounds produced through combinatorial synthesis. This novel technology was utilized in the early drug discovery stages for robust hit identification and lead optimization. In this project, the aim was to build a Machine Learning- based QSAR model with conformal prediction for hit identification on two different target proteins, the DEL was assayed on. An initial investigation was conducted on a pilot project with 1000 compounds and the analyses and the conclusions drawn from this part were later applied to a larger dataset with 1.2 million compounds. With this classification model, the prediction of the compound activity in the DEL as well as in an external dataset was aimed to be analyzed with identification of the top hits to evaluate model’s performance and applicability. Support Vector Machine (SVM) and Random Forest (RF) models were built on both the pilot and the main datasets with different descriptor sets of Signature Fingerprints, RDKIT and CDK. In addition, an Autoencoder was used to supply data-driven descriptors on the pilot data as well. The Libsvm and the Liblinear implementations were explored and compared based on the models’ performances. The comparisons were made by considering the key concepts of conformal prediction such as the trade-off between validity and efficiency, observed fuzziness and the calibration against a range of significance levels. The top hits were determined by two sorting methods, credibility and p-value differences between the binary classes. The assignment of correct single-labels to the true actives over a wide range of significance levels regardless of the similarity of the test compounds to the training set was confirmed for the models. Furthermore, an accumulation of these true actives in the models’ top hit selections was observed according to the latter sorting method and additional investigations on the similarity and the building block enrichments in the top 50 and 100 compounds were conducted. The Tanimoto similarity demonstrated the model’s predictive power in selecting structurally dissimilar compounds while the building block enrichment analysis showed the selectivity of the binding pocket where the target protein B was determined to be more selective. All of these comparison methods enabled an extensive study on the model evaluation and performance. In conclusion, the Liblinear model with the Signature Fingerprints was concluded to give the best model performance for both the pilot and the main datasets with the considerations of the model performances and the computational power requirements. However, an external set prediction was not successful due to the low structural diversity in the DEL which the model was trained on.
34

Zhang, Yi. "Strategies for Combining Tree-Based Ensemble Models." NSUWorks, 2017. http://nsuworks.nova.edu/gscis_etd/1021.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Ensemble models have proved effective in a variety of classification tasks. These models combine the predictions of several base models to achieve higher out-of-sample classification accuracy than the base models. Base models are typically trained using different subsets of training examples and input features. Ensemble classifiers are particularly effective when their constituent base models are diverse in terms of their prediction accuracy in different regions of the feature space. This dissertation investigated methods for combining ensemble models, treating them as base models. The goal is to develop a strategy for combining ensemble classifiers that results in higher classification accuracy than the constituent ensemble models. Three of the best performing tree-based ensemble methods – random forest, extremely randomized tree, and eXtreme gradient boosting model – were used to generate a set of base models. Outputs from classifiers generated by these methods were then combined to create an ensemble classifier. This dissertation systematically investigated methods for (1) selecting a set of diverse base models, and (2) combining the selected base models. The methods were evaluated using public domain data sets which have been extensively used for benchmarking classification models. The research established that applying random forest as the final ensemble method to integrate selected base models and factor scores of multiple correspondence analysis turned out to be the best ensemble approach.
35

Tuulaikhuu, Baigal-Amar. "Influences of toxicants on freshwater biofilms and fish: from experimental approaches to statistical models." Doctoral thesis, Universitat de Girona, 2016. http://hdl.handle.net/10803/392157.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
The main aims of this thesis are: i) to evaluate arsenic toxicity in two major, interacting components of the freshwater ecosystem, biofilm and fish, to provide information on environmentally realistic exposures and on biotic interactions, such as nutrient cycling that modulate toxicity; and ii) to rank predictors of toxicity to fish and quantify the differences in sensitivity among fish species. Our results highlight the interest and application of incorporating some of the complexity of natural systems in ecotoxicology and highlights that the current criterion continuous concentration for arsenic should be updated. We examined the factors that best predict toxicity in a set of widespread fishes using the random forest technique and assessed the importance of differential sensitivity among fish species using analyses of covariance. Our result indicates that caution should be exercised when extrapolating toxicological results since fish species differ in sensitivity and respond differently to different chemicals.
Los principales objetivos de esta tesis doctoral son: i) evaluar la toxicidad del arsénico en dos elementos clave que interactúan en el ecosistema acuático, biofilm y peces, para proporcionar información sobre los efectos de niveles de contaminación realistas a nivel ambiental y sus interacciones con otros factores que modulan la toxicidad, tales como el reciclado de los nutrientes; y ii) clasificar predictores de toxicidad para los peces y cuantificar las diferencias de sensibilidad entre las especies. Nuestros resultados ponen de manifiesto el interés y la aplicación de la incorporación de algunas de las complejidades de los sistemas naturales en ecotoxicología y destaca que el criterio actual de concentración continua para el arsénico debe ser actualizada. Se examinaron los factores que mejor predicen la toxicidad en un conjunto amplio de peces utilizando la técnica denominada “Random Forests” y se evaluó la importancia de la sensibilidad diferencial entre las especies de peces utilizando el análisis de la covarianza. Nuestro resultado indica que se debe tener precaución al extrapolar los resultados toxicológicos ya que las especies de peces difieren en la sensibilidad y responden de manera diferente a diferentes productos químicos.
36

Dunja, Vrbaški. "Primena mašinskog učenja u problemu nedostajućih podataka pri razvoju prediktivnih modela." Phd thesis, Univerzitet u Novom Sadu, Fakultet tehničkih nauka u Novom Sadu, 2020. https://www.cris.uns.ac.rs/record.jsf?recordId=114270&source=NDLTD&language=en.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Problem nedostajućih podataka je često prisutan prilikom razvojaprediktivnih modela. Umesto uklanjanja podataka koji sadrževrednosti koje nedostaju mogu se primeniti metode za njihovuimputaciju. Disertacija predlaže metodologiju za pristup analiziuspešnosti imputacija prilikom razvoja prediktivnih modela. Naosnovu iznete metodologije prikazuju se rezultati primene algoritamamašinskog učenja, kao metoda imputacije, prilikom razvoja određenih,konkretnih prediktivnih modela.
The problem of missing data is often present when developing predictivemodels. Instead of removing data containing missing values, methods forimputation can be applied. The dissertation proposes a methodology foranalysis of imputation performance in the development of predictive models.Based on the proposed methodology, results of the application of machinelearning algorithms, as an imputation method in the development of specificmodels, are presented.
37

Brüls, Maxim. "FAULT DETECTION FOR SMALL-SCALE PHOTOVOLTAIC POWER INSTALLATIONS : A Case Study of a Residential Solar Power System." Thesis, Högskolan Dalarna, Mikrodataanalys, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:du-35965.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Fault detection for residential photovoltaic power systems is an often-ignored problem. This thesis introduces a novel method for detecting power losses due to faults in solar panel performance. Five years of data from a residential system in Dalarna, Sweden, was applied on a random forest regression to estimate power production. Estimated power was compared to true power to assess the performance of the power generating systems. By identifying trends in the difference and estimated power production, faults can be identified. The model is sufficiently competent to identify consistent energy losses of 10% or greater of the expected power output, while requiring only minimal modifications to existing power generating systems.
38

Chalupa, Daniel. "Rozšiřující modul platformy 3D Slicer pro segmentaci tomografických obrazů." Master's thesis, Vysoké učení technické v Brně. Fakulta elektrotechniky a komunikačních technologií, 2017. http://www.nusl.cz/ntk/nusl-316852.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
This work explores machine learning as a tool for medical images' classification. A literary research is contained concerning both classical and modern approaches to image segmentation. The main purpose of this work is to design and implement an extension for the 3D Slicer platform. The extension uses machine learning to classify images using set parameters. The extension is tested on tomographic images obtained by nuclear magnetic resonance and observes the accuracy of the classification and usability in practice.
39

Mercadier, Mathieu. "Banking risk indicators, machine learning and one-sided concentration inequalities." Thesis, Limoges, 2020. http://aurore.unilim.fr/theses/nxfile/default/a5bdd121-a1a2-434e-b7f9-598508c52104/blobholder:0/2020LIMO0001.pdf.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Cette thèse de doctorat comprend trois essais portant sur la mise en œuvre, et le cas échéant l'amélioration, de mesures de risques financiers et l'évaluation des risques bancaires, basée sur des méthodes issues de l'apprentissage machine. Le premier chapitre élabore une formule élémentaire, appelée E2C, d'estimation des primes de risque de crédit inspirée de CreditGrades, et en améliore la précision avec un algorithme de forêts d'arbres décisionnels. Nos résultats soulignent le rôle prépondérant tenu par cet estimateur et l'apport additionnel de la notation financière et de la taille de l'entreprise considérée. Le deuxième chapitre infère une version unilatérale de l'inégalité bornant la probabilité d'une variable aléatoire distribuée unimodalement. Nos résultats montrent que l'hypothèse d'unimodalité des rendements d'actions est généralement admissible, nous permettant ainsi d'affiner les bornes de mesures de risques individuels, de commenter les implications pour des multiplicateurs de risques extrêmes, et d'en déduire des versions simplifiées des bornes de mesures de risques systémiques. Le troisième chapitre fournit un outil d'aide à la décision regroupant les banques cotées par niveau de risque en s'appuyant sur une version ajustée de l'algorithme des k-moyennes. Ce processus entièrement automatisé s'appuie sur un très large univers d'indicateurs de risques individuels et systémiques synthétisés en un sous-ensemble de facteurs représentatifs. Les résultats obtenus sont agrégés par pays et région, offrant la possibilité d'étudier des zones de fragilité. Ils soulignent l'importance d'accorder une attention particulière à l'impact ambigu de la taille des banques sur les mesures de risques systémiques
This doctoral thesis is a collection of three essays aiming to implement, and if necessary to improve, financial risk measures and to assess banking risks, using machine learning methods. The first chapter offers an elementary formula inspired by CreditGrades, called E2C, estimating CDS spreads, whose accuracy is improved by a random forest algorithm. Our results emphasize the E2C's key role and the additional contribution of a specific company's debt rating and size. The second chapter infers a one-sided version of the inequality bounding the probability of a unimodal random variable. Our results show that the unimodal assumption for stock returns is generally accepted, allowing us to refine individual risk measures' bounds, to discuss implications for tail risk multipliers, and to infer simple versions of bounds of systemic measures. The third chapter provides a decision support tool clustering listed banks depending on their riskiness using an adjusted version of the k-means algorithm. This entirely automatic process is based on a very large set of stand-alone and systemic risk indicators reduced to representative factors. The obtained results are aggregated per country and region, offering the opportunity to study zones of fragility. They underline the importance of paying a particular attention to the ambiguous impact of banks' size on systemic measures
40

Al, Tobi Amjad Mohamed. "Anomaly-based network intrusion detection enhancement by prediction threshold adaptation of binary classification models." Thesis, University of St Andrews, 2018. http://hdl.handle.net/10023/17050.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Network traffic exhibits a high level of variability over short periods of time. This variability impacts negatively on the performance (accuracy) of anomaly-based network Intrusion Detection Systems (IDS) that are built using predictive models in a batch-learning setup. This thesis investigates how adapting the discriminating threshold of model predictions, specifically to the evaluated traffic, improves the detection rates of these Intrusion Detection models. Specifically, this thesis studied the adaptability features of three well known Machine Learning algorithms: C5.0, Random Forest, and Support Vector Machine. The ability of these algorithms to adapt their prediction thresholds was assessed and analysed under different scenarios that simulated real world settings using the prospective sampling approach. A new dataset (STA2018) was generated for this thesis and used for the analysis. This thesis has demonstrated empirically the importance of threshold adaptation in improving the accuracy of detection models when training and evaluation (test) traffic have different statistical properties. Further investigation was undertaken to analyse the effects of feature selection and data balancing processes on a model's accuracy when evaluation traffic with different significant features were used. The effects of threshold adaptation on reducing the accuracy degradation of these models was statistically analysed. The results showed that, of the three compared algorithms, Random Forest was the most adaptable and had the highest detection rates. This thesis then extended the analysis to apply threshold adaptation on sampled traffic subsets, by using different sample sizes, sampling strategies and label error rates. This investigation showed the robustness of the Random Forest algorithm in identifying the best threshold. The Random Forest algorithm only needed a sample that was 0.05% of the original evaluation traffic to identify a discriminating threshold with an overall accuracy rate of nearly 90% of the optimal threshold.
41

Victors, Mason Lemoyne. "A Classification Tool for Predictive Data Analysis in Healthcare." BYU ScholarsArchive, 2013. https://scholarsarchive.byu.edu/etd/5639.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Hidden Markov Models (HMMs) have seen widespread use in a variety of applications ranging from speech recognition to gene prediction. While developed over forty years ago, they remain a standard tool for sequential data analysis. More recently, Latent Dirichlet Allocation (LDA) was developed and soon gained widespread popularity as a powerful topic analysis tool for text corpora. We thoroughly develop LDA and a generalization of HMMs and demonstrate the conjunctive use of both methods in predictive data analysis for health care problems. While these two tools (LDA and HMM) have been used in conjunction previously, we use LDA in a new way to reduce the dimensionality involved in the training of HMMs. With both LDA and our extension of HMM, we train classifiers to predict development of Chronic Kidney Disease (CKD) in the near future.
42

Olaya, Marín Esther Julia. "Ecological models at fish community and species level to support effective river restoration." Doctoral thesis, Universitat Politècnica de València, 2013. http://hdl.handle.net/10251/28853.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
RESUMEN Los peces nativos son indicadores de la salud de los ecosistemas acuáticos, y se han convertido en un elemento de calidad clave para evaluar el estado ecológico de los ríos. La comprensión de los factores que afectan a las especies nativas de peces es importante para la gestión y conservación de los ecosistemas acuáticos. El objetivo general de esta tesis es analizar las relaciones entre variables biológicas y de hábitat (incluyendo la conectividad) a través de una variedad de escalas espaciales en los ríos Mediterráneos, con el desarrollo de herramientas de modelación para apoyar la toma de decisiones en la restauración de ríos. Esta tesis se compone de cuatro artículos. El primero tiene como objetivos modelar la relación entre un conjunto de variables ambientales y la riqueza de especies nativas (NFSR), y evaluar la eficacia de potenciales acciones de restauración para mejorar la NFSR en la cuenca del río Júcar. Para ello se aplicó un enfoque de modelación de red neuronal artificial (ANN), utilizando en la fase de entrenamiento el algoritmo Levenberg-Marquardt. Se aplicó el método de las derivadas parciales para determinar la importancia relativa de las variables ambientales. Según los resultados, el modelo de ANN combina variables que describen la calidad de ribera, la calidad del agua y el hábitat físico, y ayudó a identificar los principales factores que condicionan el patrón de distribución de la NFSR en los ríos Mediterráneos. En la segunda parte del estudio, el modelo fue utilizado para evaluar la eficacia de dos acciones de restauración en el río Júcar: la eliminación de dos azudes abandonados, con el consiguiente incremento de la proporción de corrientes. Estas simulaciones indican que la riqueza aumenta con el incremento de la longitud libre de barreras artificiales y la proporción del mesohabitat de corriente, y demostró la utilidad de las ANN como una poderosa herramienta para apoyar la toma de decisiones en el manejo y restauración ecológica de los ríos Mediterráneos. El segundo artículo tiene como objetivo determinar la importancia relativa de los dos principales factores que controlan la reducción de la riqueza de peces (NFSR), es decir, las interacciones entre las especies acuáticas, variables del hábitat (incluyendo la conectividad fluvial) y biológicas (incluidas las especies invasoras) en los ríos Júcar, Cabriel y Turia. Con este fin, tres modelos de ANN fueron analizados: el primero fue construido solamente con variables biológicas, el segundo se construyó únicamente con variables de hábitat y el tercero con la combinación de estos dos grupos de variables. Los resultados muestran que las variables de hábitat son los ¿drivers¿ más importantes para la distribución de NFSR, y demuestran la importancia ecológica de los modelos desarrollados. Los resultados de este estudio destacan la necesidad de proponer medidas de mitigación relacionadas con la mejora del hábitat (incluyendo la variabilidad de caudales en el río) como medida para conservar y restaurar los ríos Mediterráneos. El tercer artículo busca comparar la fiabilidad y relevancia ecológica de dos modelos predictivos de NFSR, basados en redes neuronales artificiales (ANN) y random forests (RF). La relevancia de las variables seleccionadas por cada modelo se evaluó a partir del conocimiento ecológico y apoyado por otras investigaciones. Los dos modelos fueron desarrollados utilizando validación cruzada k-fold y su desempeño fue evaluado a través de tres índices: el coeficiente de determinación (R2 ), el error cuadrático medio (MSE) y el coeficiente de determinación ajustado (R2 adj). Según los resultados, RF obtuvo el mejor desempeño en entrenamiento. Pero, el procedimiento de validación cruzada reveló que ambas técnicas generaron resultados similares (R2 = 68% para RF y R2 = 66% para ANN). La comparación de diferentes métodos de machine learning es muy útil para el análisis crítico de los resultados obtenidos a través de los modelos. El cuarto artículo tiene como objetivo evaluar la capacidad de las ANN para identificar los factores que afectan a la densidad y la presencia/ausencia de Luciobarbus guiraonis en la demarcación hidrográfica del Júcar. Se utilizó una red neuronal artificial multicapa de tipo feedforward (ANN) para representar relaciones no lineales entre descriptores de L. guiraonis con variables biológicas y de hábitat. El poder predictivo de los modelos se evaluó con base en el índice Kappa (k), la proporción de casos correctamente clasificados (CCI) y el área bajo la curva (AUC) característica operativa del receptor (ROC). La presencia/ausencia de L. guiraonis fue bien predicha por el modelo ANN (CCI = 87%, AUC = 0.85 y k = 0.66). La predicción de la densidad fue moderada (CCI = 62%, AUC = 0.71 y k = 0.43). Las variables más importantes que describen la presencia/ausencia fueron: radiación solar, área de drenaje y la proporción de especies exóticas de peces con un peso relativo del 27.8%, 24.53% y 13.60% respectivamente. En el modelo de densidad, las variables más importantes fueron el coeficiente de variación de los caudales medios anuales con una importancia relativa del 50.5% y la proporción de especies exóticas de peces con el 24.4%. Los modelos proporcionan información importante acerca de la relación de L. guiraonis con variables bióticas y de hábitat, este nuevo conocimiento podría utilizarse para apoyar futuros estudios y para contribuir en la toma de decisiones para la conservación y manejo de especies en los en los ríos Júcar, Cabriel y Turia.
Olaya Marín, EJ. (2013). Ecological models at fish community and species level to support effective river restoration [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/28853
TESIS
43

Jobe, Ndey Isatou. "Nonlinearity In Exchange Rates : Evidence From African Economies." Thesis, Uppsala universitet, Statistiska institutionen, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-297055.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
In an effort to assess the predictive ability of exchange rate models when data on African countries is sampled, this paper studies nonlinear modelling and prediction of the nominal exchange rate series of the United States dollar to currencies of thirty-eight African states using the smooth transition autoregressive (STAR) model. A three step analysis is undertaken. One, it investigates nonlinearity in all nominal exchange rate series examined using a chain of credible statistical in-sample tests. Significantly, evidence of nonlinear exponential STAR (ESTAR) dynamics is detected across all series. Two, linear models are provided another chance to make it right by shuffling to data on African countries to investigate their predictive power against the tough random walk without drift model. Linear models again failed significantly. Lastly, the predictive ability of nonlinear models against both the random walk without drift and the corresponding linear models is investigated. Nonlinear models display useful forecasting gains over all contending models.
44

Fouemkeu, Norbert. "Modélisation de l’incertitude sur les trajectoires d’avions." Thesis, Lyon 1, 2010. http://www.theses.fr/2010LYO10217/document.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Dans cette thèse, nous proposons des modèles probabilistes et statistiques d’analyse de données multidimensionnelles pour la prévision de l’incertitude sur les trajectoires d’aéronefs. En supposant que pendant le vol, chaque aéronef suit sa trajectoire 3D contenue dans son plan de vol déposé, nous avons utilisé l’ensemble des caractéristiques de l’environnement des vols comme variables indépendantes pour expliquer l’heure de passage des aéronefs sur les points de leur trajectoire de vol prévue. Ces caractéristiques sont : les conditions météorologiques et atmosphériques, les paramètres courants des vols, les informations contenues dans les plans de vol déposés et la complexité de trafic. Typiquement, la variable dépendante dans cette étude est la différence entre les instants observés pendant le vol et les instants prévus dans les plans de vol pour le passage des aéronefs sur les points de leur trajectoire prévue : c’est la variable écart temporel. En utilisant une technique basée sur le partitionnement récursif d’un échantillon des données, nous avons construit quatre modèles. Le premier modèle que nous avons appelé CART classique est basé sur le principe de la méthode CART de Breiman. Ici, nous utilisons un arbre de régression pour construire une typologie des points des trajectoires des vols en fonction des caractéristiques précédentes et de prévoir les instants de passage des aéronefs sur ces points. Le second modèle appelé CART modifié est une version améliorée du modèle précédent. Ce dernier est construit en remplaçant les prévisions calculées par l’estimation de la moyenne de la variable dépendante dans les nœuds terminaux du modèle CART classique par des nouvelles prévisions données par des régressions multiples à l’intérieur de ces nœuds. Ce nouveau modèle développé en utilisant l’algorithme de sélection et d’élimination des variables explicatives (Stepwise) est parcimonieux. En effet, pour chaque nœud terminal, il permet d’expliquer le temps de vol par des variables indépendantes les plus pertinentes pour ce nœud. Le troisième modèle est fondé sur la méthode MARS, modèle de régression multiple par les splines adaptatives. Outre la continuité de l’estimateur de la variable dépendante, ce modèle permet d’évaluer les effets directs des prédicteurs et de ceux de leurs interactions sur le temps de passage des aéronefs sur les points de leur trajectoire de vol prévue. Le quatrième modèle utilise la méthode d’échantillonnage bootstrap. Il s’agit notamment des forêts aléatoires où pour chaque échantillon bootstrap de l’échantillon de données initial, un modèle d’arbre de régression est construit, et la prévision du modèle général est obtenue par une agrégation des prévisions sur l’ensemble de ces arbres. Malgré le surapprentissage observé sur ce modèle, il est robuste et constitue une solution au problème d’instabilité des arbres de régression propre à la méthode CART. Les modèles ainsi construits ont été évalués et validés en utilisant les données test. Leur application au calcul des prévisions de la charge secteur en nombre d’avions entrants a montré qu’un horizon de prévision d’environ 20 minutes pour une fenêtre de temps supérieure à 20 minutes permettait d’obtenir les prévisions avec des erreurs relatives inférieures à 10%. Parmi ces modèles, CART classique et les forêts aléatoires présentaient de meilleures performances. Ainsi, pour l’autorité régulatrice des courants de trafic aérien, ces modèles constituent un outil d’aide pour la régulation et la planification de la charge des secteurs de l’espace aérien contrôlé
In this thesis we propose probabilistic and statistic models based on multidimensional data for forecasting uncertainty on aircraft trajectories. Assuming that during the flight, aircraft follows his 3D trajectory contained into his initial flight plan, we used all characteristics of flight environment as predictors to explain the crossing time of aircraft at given points on their planned trajectory. These characteristics are: weather and atmospheric conditions, flight current parameters, information contained into the flight plans and the air traffic complexity. Typically, in this study, the dependent variable is difference between actual time observed during flight and planned time to cross trajectory planned points: this variable is called temporal difference. We built four models using method based on partitioning recursive of the sample. The first called classical CART is based on Breiman CART method. Here, we use regression trees to build points typology of aircraft trajectories based on previous characteristics and to forecast crossing time of aircrafts on these points. The second model called amended CART is the previous model improved. This latter is built by replacing forecasting estimated by the mean of dependent variable inside the terminal nodes of classical CART by new forecasting given by multiple regression inside these nodes. This new model developed using Stepwise algorithm is parcimonious because for each terminal node it permits to explain the flight time by the most relevant predictors inside the node. The third model is built based on MARS (Multivariate adaptive regression splines) method. Besides continuity of the dependent variable estimator, this model allows to assess the direct and interaction effects of the explanatory variables on the crossing time on flight trajectory points. The fourth model uses boostrap sampling method. It’s random forests where for each bootstrap sample from the initial data, a tree regression model is built like in CART method. The general model forecasting is obtained by aggregating forecasting on the set of trees. Despite the overfitting observed on this model, it is robust and constitutes a solution against instability problem concerning regression trees obtained from CART method. The models we built have been assessed and validated using data test. Their using to compute the sector load forecasting in term to aircraft count entering the sector shown that, the forecast time horizon about 20 minutes with the interval time larger than 20 minutes, allowed to obtain forecasting with relative errors less than 10%. Among all these models, classical CART and random forests are more powerful. Hence, for regulator authority these models can be a very good help for managing the sector load of the airspace controlled
45

Laqrichi, Safae. "Approche pour la construction de modèles d'estimation réaliste de l'effort/coût de projet dans un environnement incertain : application au domaine du développement logiciel." Thesis, Ecole nationale des Mines d'Albi-Carmaux, 2015. http://www.theses.fr/2015EMAC0013/document.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
L'estimation de l'effort de développement logiciel est l'une des tâches les plus importantes dans le management de projets logiciels. Elle constitue la base pour la planification, le contrôle et la prise de décision. La réalisation d'estimations fiables en phase amont des projets est une activité complexe et difficile du fait, entre autres, d'un manque d'informations sur le projet et son avenir, de changements rapides dans les méthodes et technologies liées au domaine logiciel et d'un manque d'expérience avec des projets similaires. De nombreux modèles d'estimation existent, mais il est difficile d'identifier un modèle performant pour tous les types de projets et applicable à toutes les entreprises (différents niveaux d'expérience, technologies maitrisées et pratiques de management de projet). Globalement, l'ensemble de ces modèles formule l'hypothèse forte que (1) les données collectées sont complètes et suffisantes, (2) les lois reliant les paramètres caractérisant les projets sont parfaitement identifiables et (3) que les informations sur le nouveau projet sont certaines et déterministes. Or, dans la réalité du terrain cela est difficile à assurer. Deux problématiques émergent alors de ces constats : comment sélectionner un modèle d'estimation pour une entreprise spécifique ? et comment conduire une estimation pour un nouveau projet présentant des incertitudes ? Les travaux de cette thèse s'intéressent à répondre à ces questions en proposant une approche générale d'estimation. Cette approche couvre deux phases : une phase de construction du système d'estimation et une phase d'utilisation du système pour l'estimation de nouveaux projets. La phase de construction du système d'estimation est composée de trois processus : 1) évaluation et comparaison fiable de différents modèles d'estimation, et sélection du modèle d'estimation le plus adéquat, 2) construction d'un système d'estimation réaliste à partir du modèle d'estimation sélectionné et 3) utilisation du système d'estimation dans l'estimation d'effort de nouveaux projets caractérisés par des incertitudes. Cette approche intervient comme un outil d'aide à la décision pour les chefs de projets dans l'aide à l'estimation réaliste de l'effort, des coûts et des délais de leurs projets logiciels. L'implémentation de l'ensemble des processus et pratiques développés dans le cadre de ces travaux ont donné naissance à un prototype informatique open-source. Les résultats de cette thèse s'inscrivent dans le cadre du projet ProjEstimate FUI13
Software effort estimation is one of the most important tasks in the management of software projects. It is the basis for planning, control and decision making. Achieving reliable estimates in projects upstream phases is a complex and difficult activity because, among others, of the lack of information about the project and its future, the rapid changes in the methods and technologies related to the software field and the lack of experience with similar projects. Many estimation models exist, but it is difficult to identify a successful model for all types of projects and that is applicable to all companies (different levels of experience, mastered technologies and project management practices). Overall, all of these models form the strong assumption that (1) the data collected are complete and sufficient, (2) laws linking the parameters characterizing the projects are fully identifiable and (3) information on the new project are certain and deterministic. However, in reality on the ground, that is difficult to be ensured.Two problems then emerge from these observations: how to select an estimation model for a specific company ? and how to conduct an estimate for a new project that presents uncertainties ?The work of this thesis interested in answering these questions by proposing a general estimation framework. This framework covers two phases: the construction phase of the estimation system and system usage phase for estimating new projects. The construction phase of the rating system consists of two processes: 1) evaluation and reliable comparison of different estimation models then selection the most suitable estimation model, 2) construction of a realistic estimation system from the selected estimation model and 3) use of the estimation system in estimating effort of new projects that are characterized by uncertainties. This approach acts as an aid to decision making for project managers in supporting the realistic estimate of effort, cost and time of their software projects. The implementation of all processes and practices developed as part of this work has given rise to an open-source computer prototype. The results of this thesis fall in the context of ProjEstimate FUI13 project
46

Ekeberg, Lukas, and Alexander Fahnehjelm. "Maskininlärning som verktyg för att extrahera information om attribut kring bostadsannonser i syfte att maximera försäljningspris." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-240401.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
The Swedish real estate market has been digitalized over the past decade with the current practice being to post your real estate advertisement online. A question that has arisen is how a seller can optimize their public listing to maximize the selling premium. This paper analyzes the use of three machine learning methods to solve this problem: Linear Regression, Decision Tree Regressor and Random Forest Regressor. The aim is to retrieve information regarding how certain attributes contribute to the premium value. The dataset used contains apartments sold within the years of 2014-2018 in the Östermalm / Djurgården district in Stockholm, Sweden. The resulting models returned an R2-value of approx. 0.26 and Mean Absolute Error of approx. 0.06. While the models were not accurate regarding prediction of premium, information was still able to be extracted from the models. In conclusion, a high amount of views and a publication made in April provide the best conditions for an advertisement to reach a high selling premium. The seller should try to keep the amount of days since publication lower than 15.5 days and avoid publishing on a Tuesday.
Den svenska bostadsmarknaden har blivit alltmer digitaliserad under det senaste årtiondet med nuvarande praxis att säljaren publicerar sin bostadsannons online. En fråga som uppstår är hur en säljare kan optimera sin annons för att maximera budpremie. Denna studie analyserar tre maskininlärningsmetoder för att lösa detta problem: Linear Regression, Decision Tree Regressor och Random Forest Regressor. Syftet är att utvinna information om de signifikanta attribut som påverkar budpremien. Det dataset som använts innehåller lägenheter som såldes under åren 2014-2018 i Stockholmsområdet Östermalm / Djurgården. Modellerna som togs fram uppnådde ett R²-värde på approximativt 0.26 och Mean Absolute Error på approximativt 0.06. Signifikant information kunde extraheras from modellerna trots att de inte var exakta i att förutspå budpremien. Sammanfattningsvis skapar ett stort antal visningar och en publicering i april de bästa förutsättningarna för att uppnå en hög budpremie. Säljaren ska försöka hålla antal dagar sedan publicering under 15.5 dagar och undvika att publicera på tisdagar.
47

Romão, Joana Mendonça Vasconcelos. "Modelos para estimar taxas de retenção de clientes : aplicação a uma carteira de seguro automóvel." Master's thesis, Instituto Superior de Economia e Gestão, 2019. http://hdl.handle.net/10400.5/19740.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Mestrado em Actuarial Science
O acesso à informação tem-se tornado cada vez mais fácil. A comparação entre condições tarifárias de diferentes seguradoras é hoje mais frequente, com efeito nas taxas de retenção de clientes e respetivos contratos de seguro. A importância que é dada a este tema é cada vez maior e a construção de ferramentas para estimar as referidas taxas permite tomar medidas para a retenção de negócio rentável e o agravamento dos prémios de contratos menos rentáveis. Este trabalho teve como objetivo estimar a probabilidade de retenção à data de vencimento de uma apólice de seguro, numa carteira do ramo automóvel. Verificado o problema de desequilíbrio entre as classes da variável resposta, a escolha das metodologias a usar baseou-se essencialmente na procura de aumentar a exatidão do modelo final e contornar esse problema.
With an increasingly easy accessibility to information, there is a growing concern about customer retention rates. Insurers are giving more importance on having accurate tools to monitor the policies renewal process, making them allowed to keep with the profitable business and increase premiums on the less profitable one. The objective of this study was to estimate the probability of renewing a policy in a motor insurance portfolio. To be working with an imbalance data set made us try different modelling methodologies, where all of them were chosen based on the need to increase the predictive performance of the model.
info:eu-repo/semantics/publishedVersion
48

Jouganous, Julien. "Modélisation et simulation de la croissance de métastases pulmonaires." Thesis, Bordeaux, 2015. http://www.theses.fr/2015BORD0154/document.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Cette thèse présente des travaux de modélisation mathématique de la croissance tumorale appliqués aux cas de métastases pulmonaires.La première partie de cette thèse décrit un premier modèle d’équations aux dérivées partielles permettant de simuler la croissance métastatique mais aussi la réponse de la tumeur à certains types de traitements. Une méthode de calibration du modèle à partir de données cliniques issues de l’imagerie médicale est développée et testée sur plusieurs cas cliniques.La deuxième partie de ces travaux introduit une simplification du modèle et de l’algorithme de calibration. Cette méthode, plus robuste, est testée sur un panel de 36 cas test et les résultats sont présentés dans le troisième chapitre. La quatrième et dernière partie développe un algorithme d’apprentissage automatisé permettant de tenir compte de données supplémentaires à celles utilisées par le modèle afin d’affiner l’étape de calibration
This thesis deals with mathematical modeling and simulation of lung metastases growth.We first present a partial differential equations model to simulate the growth and possibly the response to some types of treatments of metastases to the lung. This model must be personalized to be used individually on clinical cases. Consequently, we developed a calibration technic based on medical images of the tumor. Several applications on clinical cases are presented.Then we introduce a simplification of the first model and the calibration algorithm. This new method, more robust, is tested on 36 clinical cases. The results are presented in the third chapter. To finish, a machine learning algorithm
49

Taillardat, Maxime. "Méthodes Non-Paramétriques de Post-Traitement des Prévisions d'Ensemble." Thesis, Université Paris-Saclay (ComUE), 2017. http://www.theses.fr/2017SACLV072/document.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
En prévision numérique du temps, les modèles de prévision d'ensemble sont devenus un outil incontournable pour quantifier l'incertitude des prévisions et fournir des prévisions probabilistes. Malheureusement, ces modèles ne sont pas parfaits et une correction simultanée de leur biais et de leur dispersion est nécessaire.Cette thèse présente de nouvelles méthodes de post-traitement statistique des prévisions d'ensemble. Celles-ci ont pour particularité d'être basées sur les forêts aléatoires.Contrairement à la plupart des techniques usuelles, ces méthodes non-paramétriques permettent de prendre en compte la dynamique non-linéaire de l'atmosphère.Elles permettent aussi d'ajouter des covariables (autres variables météorologiques, variables temporelles, géographiques...) facilement et sélectionnent elles-mêmes les prédicteurs les plus utiles dans la régression. De plus, nous ne faisons aucune hypothèse sur la distribution de la variable à traiter. Cette nouvelle approche surpasse les méthodes existantes pour des variables telles que la température et la vitesse du vent.Pour des variables reconnues comme difficiles à calibrer, telles que les précipitations sexti-horaires, des versions hybrides de nos techniques ont été créées. Nous montrons que ces versions hybrides (ainsi que nos versions originales) sont meilleures que les méthodes existantes. Elles amènent notamment une véritable valeur ajoutée pour les pluies extrêmes.La dernière partie de cette thèse concerne l'évaluation des prévisions d'ensemble pour les événements extrêmes. Nous avons montré quelques propriétés concernant le Continuous Ranked Probability Score (CRPS) pour les valeurs extrêmes. Nous avons aussi défini une nouvelle mesure combinant le CRPS et la théorie des valeurs extrêmes, dont nous examinons la cohérence sur une simulation ainsi que dans un cadre opérationnel.Les résultats de ce travail sont destinés à être insérés au sein de la chaîne de prévision et de vérification à Météo-France
In numerical weather prediction, ensemble forecasts systems have become an essential tool to quantifyforecast uncertainty and to provide probabilistic forecasts. Unfortunately, these models are not perfect and a simultaneouscorrection of their bias and their dispersion is needed.This thesis presents new statistical post-processing methods for ensemble forecasting. These are based onrandom forests algorithms, which are non-parametric.Contrary to state of the art procedures, random forests can take into account non-linear features of atmospheric states. They easily allowthe addition of covariables (such as other weather variables, seasonal or geographic predictors) by a self-selection of the mostuseful predictors for the regression. Moreover, we do not make assumptions on the distribution of the variable of interest. This new approachoutperforms the existing methods for variables such as surface temperature and wind speed.For variables well-known to be tricky to calibrate, such as six-hours accumulated rainfall, hybrid versions of our techniqueshave been created. We show that these versions (and our original methods) are better than existing ones. Especially, they provideadded value for extreme precipitations.The last part of this thesis deals with the verification of ensemble forecasts for extreme events. We have shown several properties ofthe Continuous Ranked Probability Score (CRPS) for extreme values. We have also defined a new index combining the CRPS and the extremevalue theory, whose consistency is investigated on both simulations and real cases.The contributions of this work are intended to be inserted into the forecasting and verification chain at Météo-France
50

Le, Faou Yohann. "Contributions à la modélisation des données de durée en présence de censure : application à l'étude des résiliations de contrats d'assurance santé." Thesis, Sorbonne université, 2019. http://www.theses.fr/2019SORUS527.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Dans cette thèse, nous nous intéressons aux modèles de durée dans le contexte de la modélisation des durées de résiliation de contrats d’assurance santé. Identifié dès le 17ème siècle et les études de Graunt J. (1662) sur la mortalité, le biais induit par la censure des données de durée observées dans ce contexte doit être corrigé par les modèles statistiques utilisés. À travers la problématique de la mesure de la dépendance entre deux durées successives, et la problématique de la prédiction de la durée de résiliation d’un contrat d'assurance, nous étudions les propriétés théoriques et pratiques de différents estimateurs basés sur une méthode de pondération des observations (méthode dite IPCW) visant à corriger ce biais. L'application de ces méthodes à l'estimation de la valeur client en assurance est également détaillée
In this thesis, we study duration models in the context of the analysis of contract termination time in health insurance. Identified from the 17th century and the original work of Graunt J. (1662) on mortality, the bias induced by the censoring of duration data observed in this context must be corrected by the statistical models used. Through the problem of the measure of dependence between successives durations, and the problem of the prediction of contract termination time in insurance, we study the theoretical and practical properties of different estimators that rely on a proper weighting of the observations (the so called IPCW method) designed to compensate this bias. The application of these methods to customer value estimation is also carefully discussed

До бібліографії